Real-world experiences with AWS RDS blue/green deployments

Real-world experiences with AWS RDS blue/green deployments

Shine recently took on management of a SaaS product hosted on AWS. Whilst getting up to speed with the application functionality and architecture, I had the task of upgrading the AWS RDS instances underlying the app to a new server type. This was due to a pending deprecation of the existing server types by AWS. In this post I’m going to talk about my real-world experiences using AWS RDS Blue/Green deployments to do this migration.

The cloud is a wonderful thing

I come from a history of system administration, dating back to the days when there were physical servers on premises (I’m showing my age a bit there). Cloud services like AWS were a great leap forward in managing platforms for running software, and I’m on board with anything that makes my job easier. The ability to create another copy of a server (or entire environment) whenever you need it, and for it to be ready in just a few minutes, is wonderful. It cuts out so many barriers to getting things done that we can go back to concentrating of the really tricky stuff (like Googling how to do string interpolation for the 17th time 🙂 ).

Faced with an intimidating database server upgrade, the recently-introduced fully managed Blue/Green deployments in Amazon RDS seemed like a perfect tool for the task. I won’t go into all the details, but the general idea is that when you have an RDS instance, the tool will take the production DB environment (known as “Blue”) and create a read-replica (clone) of it in a new staging environment (also known as “Green”). This clone is then kept in sync with production by replicating all data updates.

You can then make changes to the staging environment, like upgrading server types or applying security patches. When you are satisfied with the changes, you can switch the servers over in minutes, with minimal downtime and full confidence that no data will be lost.

The documentation is to the usual AWS high standards, including a terrific user guide.

So let’s do this

I had 2 separate databases to upgrade, so I was able to go through the process twice. This was useful as a learning experience, and also to ensure the process was properly documented in our support guidelines.

I followed the instructions for the AWS Management Console, as I like the visual representation the console gives me of the current status. The steps were few and easy to follow, with limited customisable options. The key part was that I specified that the green (replica) instance size be upgraded to m5 – the main reason for doing this upgrade in the first place.

Unfortunately, this first attempt failed after a few minutes, with an error showing up in the console and the replica not being created. It turned out the problem was that I had originally selected an updated patch version of the database. This ‘should’ be allowed, but perhaps there was something about the DB or version that might have been a problem – it’s not the first time I’ve seen a patch version update break functionality. So, on my second attempt I selected the same version as the original. This time the replica was created, and the 2 databases were linked in the UI via the ‘Blue/Green Deployment’ line item:

Moving on to the other production database, I went through the same steps. This time everything worked on the first try, and I could see the following in the console (note some parts of the server names have been redacted):

The switch

After creating backups of the databases, and scheduling a window for the change, I was ready to go. Even though the downtime is supposed to be no more than a few minutes, I started the change after business hours so that I could have time to tackle any problems that might arise.

Through the console I triggered the switchover, and a couple of minutes later the process was ‘complete’, by which I mean that the application was now pointing to what was previously the staging database server.

However, the application would not let me log in. After a bit of digging, I found that the database in use was still in read-only mode. Worse, the databases had lost their ‘link’ via the ‘Blue/Green Deployment’ line item, so there wasn’t an option to switch back:

The ‘-old1’ naming convention felt a bit jarring, reminding me of my take on ‘version control’ from days gone by.

It’s not easy being green

Thankfully I had some time up my sleeve to sort it out. That said, Google was not much help. The feature is quite new, so there was not a lot of lived-experience available. However, going back over the documentation, I noticed this at the end of the user guide:

After a switchover, the DB instances in the previous blue environment are retained. Standard costs apply to these resources.
RDS renames the DB instances in the blue environment by appending -oldn to the current resource name, where n is a number. The DB instances are read-only until they are rebooted.

That last line provided a ray of hope, although it seemed strange that it was only mentioned in passing, rather than as an explicit instruction to perform a reboot. However, I thought it was worth a try and gave the instance a kick.

Still no luck. The database was still read-only. I guess that wasn’t the reason they’d mentioned it in the docs.

Pondering the situation a bit more, I noticed that, although the application was pointing to the new server, that server still seemed to be a replica of the original. I also knew that stopping the primary database should trigger a fail-over to the replica, meaning the replica would then become the primary.

I decided this was worth a shot, and manually stopped the primary myself. The result was… success!

However, it was important to remember that, when stopping an instance in this manner, it will automatically start up again after 7 days. So after giving it 24 hours in production to make sure it was all good, I permanently deleted the original database server, so only the new one was running.

Moving On

After giving myself some time to recover from the stress of the first migration, I moved on to the switch over of database number 2. I had already set up the replica, so it was just a matter of clicking the switchover button.

This time, to my relief, everything worked correctly! The instances even renamed, with the “blue” previously used in production now having the suffix -old1 and “green” now labeled as the ‘Primary’ instance. They were also still logically linked in the display, so that a switch back could be easily executed:

This is the way it should look, if everything goes to plan.

The Washup

Considering the steps I would have otherwise had to perform manually for this process, I was reasonably happy with this new feature, despite the hiccups I encountered along the way. I’ll put those down as early teething problems for a new service, rather than fundamental flaws. Stability should improve over time, and then RDS blue/green deployments can be enjoyed by all without too much drama.

1 Comment
  • Steve
    Posted at 03:21h, 09 June Reply

    We had the same / a similar problem where we started getting errors about the database being read-only after the switchover. Our theory is that the IIS connection pool was holding on to connections to the old blue database. We ended up resolving it by rebooting the IIS servers.

Leave a Reply

%d bloggers like this: