DevSecOps: Levelling up your Terraform with tfsec and Terratest

france landmark water clouds

DevSecOps: Levelling up your Terraform with tfsec and Terratest

Introduction

Automated security and compliance testing is becoming an increasingly common part of agile DevOps approaches to software development. Without automated testing, managing the sheer number of dependencies in modern software projects becomes impractical.

The rate of change within cloud platforms means that infrastructure-as-code (IaC) projects should also do this sort of checking. Fortunately, there are now IaC-specific tools available for the job. tfsec is one such tool.

In this post I will show how you can use tfsec to run security checks against your Terraform code, and address any issues that it finds. Furthermore, we’ll use the Terratest Go library to make sure we don’t accidentally cause any functional regressions along the way. I’ll use a sample project to demonstrate.

Requirements and assumptions

To follow along here you should have a basic understanding of:

  • Terraform
  • AWS
  • Golang

I won’t cover how to connect to AWS from the CLI, or how the Terraform AWS provider integrates it to plan/apply/destroy infrastructure stacks.

The Demo Project

The demo project is very basic, and very much not production-ready. It is, however, enough for the purposes of our demo. It’s a web service that does two things:

  1. Accepts POST requests with image metadata, and
  2. Returns a list of all image metadata in our database

For storage, the service stores rows of data in a SQLite database, which is kept on the same EC2 instance as the service. The server also runs Litestream to stream the SQLite database to AWS S3, so the database can be recovered in the event of an instance failure or refresh.

The project code comprises two main parts: the code for the service itself, and code for setting up the infrastructure to host that service.

Service Code

The service code contains the following files:

  • main.go: the code for the web service
  • demoservice.service: a systemd service file for running our service
  • createTable.sql: A SQL file to create a table in a SQLite database

Infrastructure Code

The infrastructure code is broken into three main directories:

  • application: Creates the infrastructure for our service and deploys our service binary. Our litestream.yml configuration file lives here as well.
  • test: Our Terratest files
  • account: Creates the VPC, subnets and internet gateway required. I won’t focus much on this in this article.

You will also notice that we have multiple versions of both the application and test folders, representing how they evolve as we refactor our code.

In the initial application folder we create the following components:

  1. A single EC2 instance, including user_data for installing and configuring our service on boot (see below for more information on this)
  2. A network interface and elastic IP that provide the public IP address for the instance
  3. A security group attached to the network interface, which controls traffic to our instance
  4. An IAM policy and role to provide the permissions our instance needs
  5. An S3 bucket for storing our application, configuration files and Litestream backups

The general flow for installing the service on the EC2 instance (via user_data) is:

1) Install SQLite and the AWS CLI on the instance
2) Download and install Litestream
3) Download and install our demo service from S3
4) Check to see if a dbcreated.flag file exists in S3. If it doesn’t, then create a new SQLlite database locally and create the flag in S3
5) Enable and start Litestream and our service

To reiterate the point I made at the start of this section: this application does not reflect best practices for an enterprise service, For example, there is no provision for high availability or scaling. However, it’s sufficient for what we’ll be covering in this post.

Our initial tests

Let’s take a look into our initial Terratest file. For starters, we need to ensure our infrastructure names are unique. In regular usage, we do this via an environment variable that is passed in on stack creation. For our testing, we will generate a random string:

serviceEnvironment := strings.ToLower(random.UniqueId())

and include it as part of the set of options that are passed in when creating our stack:

terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
	// Set the path to the Terraform code that will be tested.
	TerraformDir: "../application",
	Vars: map[string]interface{}{
		"environment": serviceEnvironment,
		"vpc_name": "terraform-tools-demo-dev",
		"subnet_name": "terraform-tools-demo-dev-subnet",
		"stack_name": "terraform-tools-demo",
		"aws_region": awsRegion,
	},
})

Note that we are hardcoding a subnet and VPC name. This will have been previously created by the code in the account folder. If you wanted to, you could create the VPC and subnet with each test. However in this case it’s outside the scope of what we want to test, so we’ll stick with the hardcoded values.

Next up, we use Golang’s defer keyword to define a couple of steps that we want to ensure happen when our test has finished running (or in the case of an unexpected exit).

defer is a last-in-first-out keyword, so in this case our test code will first empty our S3 bucket and then destroy the infrastructure created by the test:

// Clean up resources with "terraform destroy" at the end of the test.
defer terraform.Destroy(t, terraformOptions)

// Run "terraform init" and "terraform apply". Fail the test if there are any errors.
terraform.InitAndApply(t, terraformOptions)

// Empty and cleanup bucket
s3Bucket := terraform.Output(t, terraformOptions, "service_bucket")
defer aws.EmptyS3Bucket(t, awsRegion, s3Bucket)

A couple of points worth understanding if you are new to Terratest:

  • terraform.InitAndApply takes our defined options and creates our stack
  • terraform.Output retrieves any options defined as outputs from our Terraform code

Next, we’ll set up a few variables:

publicIp := terraform.Output(t, terraformOptions, "instance_public_ip")
url := fmt.Sprintf("http://%s:8080/images", publicIp)
testBody := "{\"FileName\": \"test.png\",\"Description\": \"This is my test image\"}"

Where:

  • publicIp will contain the public IP address of the elastic IP attached to our instance
  • url uses this public address to form the URL we will send HTTP requests to
  • testBody contains the raw JSON string that we will POST to our service to create a row in our database.

Now we have everything we need to send our POST request to our service:

http_helper.HTTPDoWithRetry(t, 
  "POST", 
  url, 
  []byte(testBody), 
  map[string]string{"Content-Type": "application/json"}, 
  200, 
  30, 
  5*time.Second, 
  nil
)

Where:

  • t is a standard test parameter passed to Golang test methods (t *testing.T). It manages the state of our test execution.
  • "POST" is the HTTP method we will use
  • url is the URL we will send the request to
  • []byte(testBody) is our raw JSON string, cast to an array of bytes
  • map[string]string{"Content-Type": "application/json"} is a map with our desired headers
  • 200 is the expected HTTP response (for a successful test)
  • 30 is the number of times to attempt our POST request before failing the test
  • 5*time.Second is the number of times to wait between sending the requests (five seconds)
  • nil represents our TLS config, which is undefined in this case, as we are using HTTP

So essentially, this test will attempt to send the request up to thirty times, at five second intervals. As soon as it receives a 200 HTTP response, the test will pass.

The next test will send a GET request to our API. We will then check that the expected number of items were returned, and that their values are correct. This will ensure our previous POST request resulted in the correct values having being stored in the database.

I won’t include all the code here, as the details of how we unmarshal our JSON into a map are not really relevant to this post (if you’re interested you are welcome to look at the test source). However, there is one important bit worth understanding:

// Get our image metadata
_, body := http_helper.HttpGet(t, url, nil)

This method sends a HTTP GET request to the API. It returns two values: statusCode and body. We don’t care about the statusCode here, so we ignore it by assigning it to an unused _ value. Our response is stored in the body variable.

From here we use the json.Unmarshal() method to store the raw response in a map variable called mp. This is an array of maps (which is what our API is returning).

We can now check our array contains the value we posted in the first test:.

// Get number of elements in our JSON array response
numElements := len(mp)
if (numElements != 1) {
	t.Logf("Expected 1 element returned, got: %d", numElements)
	t.FailNow()
}

Note that we expect exactly one element in our JSON response, if we get anything else, we will log that and fail our test immediately.

Finally, if this test passes, then we will just make sure the "FileName" and "Description" fields in the response match what we are expecting:

// Make sure the values coming back in the response are correct
fileName := mp[0]["FileName"]
description := mp[0]["Description"]

assert.Equal(t, "test.png", fileName)
assert.Equal(t, "This is my test image", description)

Now let’s run our test and see what the output from Terratest looks like. To do this, we change to the infra/test directory in our terminal and run:

> go test

I won’t include all of the output here as it’s very verbose. However, it will look something like this:

Firstly, our Terraform init and apply completes. We can see thirteen resources have been created, and our two outputs have been returned. Take note of the random environment identifier – in this case, gf8ole.

Next, once our infrastructure is created, we will see multiple failed POST requests happening while our instance is started. Finally, if/when the service does actually come up, we will see a POST and GET request in quick succession:

Assuming there are no immediate failures, our S3 bucket will then be emptied, and our stack destroyed:

At the end of our test run, we get a short report on how the tests went:

In this case, they all passed, so we can be confident that our application is working as expected!

Now that we have a way to easily verify that any further changes we might make won’t broken anything, let’s move on to using tfsec to check how good our Terraform code is from a security perspective,

Running tfsec for the first time

To run tfsec against the initial iteration of our code, we can simply change into the infra/application directory and run:

> tfsec

The output will look something like this:

You can see there’s some red text in there (never a good thing) and a summary that advises we have eighteen potential problems!

Fixing up the low-hanging fruit

Rather than jumping in and trying to fix everything at once, let’s start with some of the easier-to-fix issues, starting with this critical-severity result:

Ingress from the public internet

Typically in a production scenario, you would never expose an instance directly like this. However, in the case of our demo we are have deliberately allowed public internet access to the instance and are willing to accept the risk. So how do we tell tfsec to ignore this issue?

tfsec provides a simple comment-based mechanism for doing this. To activate it, we append a comment to the problematic line of code that violated the rule in the first place. In this case, the comment would appear in the code as follows:

Note how the the comment is of the form #tfsec:ignore:<rule>, where <rule> is the name of the rule that we want to suppress.

Public S3 Buckets

The next result from tfsec that we’ll look into relates to our S3 bucket:

Note that this result includes a helpful “More Information” section. In this case, the section contains links to both the tfsec rule details and to the relevant Terraform documentation. This saves us having to dig around in the documentation ourselves.

This result relates to an S3 bucket-level configuration. As it turns out, there are actually three other similarly-related issues in our results:

Result #7 HIGH No public access block so not blocking public acls
Result #8 HIGH No public access block so not blocking public policies
Result #11 HIGH No public access block so not restricting public buckets

We definitely have no need for public access in any of these cases, so let’s go ahead and tighten it up, all in one hit:

Unencrypted data

Another high-priority result is:

Result #9 HIGH Bucket does not have encryption enabled

We’ll revisit this in more detail later, but for now we’ll just enable server-side encryption of our S3 bucket, using our default KMS key:

Similarly, our instance root disk isn’t encrypted, so we’ll enable that as well:

Accessing IMDS without a token

Next, we see a result related to the instance metadata service:

Result #13 HIGH Instance does not require IMDS access to require a token

This is exactly the kind of issue that an infrastructure engineers might accidentally overlook. IMDS (Instance Metadata Service) is a service available to all instances that allows a user or administrator to get information about the instance. It is also critical to the process by which instances get temporary credentials when using an IAM role.

AWS introduced a new version of this service after a number of high profile incidents where misconfigured or vulnerable applications inadvertently exposed these temporary credentials to attackers. One of the mitigations enabled by the new version is to require applications to obtain a temporary session token before making a request to the service, thus minimising the risk of accidental exposure. This is probably best explained by the documentation for the tfsec rule:

IMDS v2 (Instance Metadata Service) introduced session authentication tokens which improve security when talking to IMDS. By default aws_instance resource sets IMDS session auth tokens to be optional. To fully protect IMDS you need to enable session tokens by using metadata_options block and its http_tokens variable set to required.

To rectify the problem, we update our instance configuration as follows:

Testing our changes

You can see the version of the code that contains the changes so far here. I also addressed a few other of the more minor issues that were raised, but won’t describe them in this post for the sake of brevity.

When we run tfsec again, we can see we’re down to 3 potential problems:

However, before continuing it would also be a good idea to make sure we haven’t accidentally broken anything. Normally we’d have to do this by applying our changes, manually testing the application, and digging around in the instance logs looking for anything suspicious. However, because we’re using Terratest, we can instead just run go test again. In this case, the result is:

Looks like everything has passed! We can now confidently move on with fixing the remaining issues.

Tackling IAM policies

Of the three remaining results, #1 and #2 are similar:

HIGH IAM policy document uses wildcarded action 's3:*'

IAM policies are notoriously difficult to get right. People often resort to using wildcard rules just to get things working. While that may save them time in the short term, it also opens them up to serious issues in the long term if they don’t take the time to go back later and tighten things up,

In these cases, tfsec is telling us that if our application or instance is compromised, then an attacker can do anything they want with all of our S3 buckets.

Let’s try and limit our access to just our application bucket and lockdown those permissions a bit:

Problem solved right? Well, to be 100% sure, let’s quickly run our tests and make sure the updated version of our application is still working:

Oh no! It looks like our change has broken our application!

If we take a closer look at the policy, we can see the issue. We specified the action s3:GetObjects, but it should actually be s3:GetObject. This is an easy mistake to make, but fortunately it’s also easy to fix. Then let’s once again run Terratest on the updated version of the app:

Looks like that fixed it and we’re working again.

Bucket Encryption, Second Attempt

Let’s now look at our third and final result from tfsec:

Result #3 HIGH Bucket does not encrypt data with a customer-managed key.

It looks like tfsec has a problem with the way in which we dealt with the earlier issue regarding unencrypted data. Specifically, rather than using the default key in the account, this rule is suggesting we create a KMS key as part of our stack and use that to encrypt our bucket contents. Let’s do that and update our bucket encryption configuration:

One last thing

Having shifted to using a new KMS key, the question arises: are our tests actually capable of testing that change? On reflection, our database creation and Litestream backup process aren’t actually being covered by the tests so far. However, they also require write access to S3, and also KMS permissions to use our new key to encrypt objects. Will what I’ve done actually work?

To find out, let’s create a new version of our test:

This new check simply ensures that our dbcreated.flag file is present. For our purposes this is good enough to confirm our instance has permissions to write to our S3 bucket.

Funnily enough, when we run the updated test, we get:

It seems that our IAM permission changes break S3 write access for our instance. Lucky we remembered to write a test for it!

To fix it, firstly we’ll allow the instance to get, put and delete objects in our bucket:

Whilst we’re at it, we’ll also need to add KMS permissions for our key, now that we’re not using the default:

Running the tests again on the updated version of our application, we get:

Looks like we’ve gotten it back to a working state! Now if we run tfsec a final time, we see:

No problems detected! That’s more like it!

Conclusions

In this post, I’ve introduced two tools that bring practices from general software development to the world of infrastructure-as-code.

I’ve used tfsec to do static analysis and pick up security issues that even an experienced engineer would struggle to notice during a manual review. tfsec has also proved to be a fantastic way to learn about what is considered to be best-practice with Terraform.

I’ve then used Terratest to ensure that my fixes for the issues that tfsec has found don’t cause any functional regressions. And I can see how the iterative, test-driven development approach enabled by Terratest might help me to confidently refactor and/or implement new changes in future.

It’s great to see that techniques we take for granted in other parts of the software development world can now be applied to infrastructure. I encourage you to level up your infrastructure-as-code practice by embracing them.

david.ensikat@shinesolutions.com

DevOps Engineer at Shine Solutions

No Comments

Leave a Reply