Mismatched plug and socket

Anyone who’s ever had to support server infrastructure of any kind knows the value of having a comprehensive, automated monitoring solution in place. With this in mind, we have begun to roll out the New Relic platform to monitor all our AWS based servers. New Relic comes with many great monitoring metrics straight out of the box, but still has the flexibility for software developers to create their own plugins for customized metrics on just about anything your users will care about.

With this in mind we set about the task of monitoring our database servers. They reside on the AWS platform and use Relational Database Service (RDS). By default, AWS provide you with some fairly good metrics that you can review with a monitoring tool such as CPU Utilization, etc.

After some initial research, we found that a New Relic plugin had already been developed that could tap into this information from AWS and post it straight into New Relic. There was even a prepackaged AMI that hosted the plugin. And it was all free, apart from the cost of having a small EC2 instance running. How awesome is that! We figured we’d install the plugin and a few hours later, be completely setup monitoring our RDS instances.

The RDS plugin has two options for installation. The recommended option is a bundled AWS EC2 instance that only requires the custom configurations. The two most important configurations are an AWS IAM user and a New Relic licence key. The IAM user allows us to pull back information about our AWS Servers and the New Relic licence is required to post metrics to New Relic.

The Problem

So the setup went smoothly and sure enough, metrics started flowing through to our New Relic account. The final step was to simply setup some alarms and we’re done, right? Well not exactly. Whilst the CPU and DB connections check just worked straight out of the box, the RDS storage space metric was little off.

A quick investigation found that the plugin was collecting metrics on how much space was free on the server. In itself that’s not a problem. But the New Relic alarms are designed to trigger when the value reported exceeds a given tolerance. The free space metric in the plugin would only ever go down in value over time, so we could never reach a state where we would alarm.

The following charts illustrate the problem. If we set the threshold to 950GB, we will never get an alarm, even though we will eventually run out of space:


If, instead, we set a low threshold of say 50GB:


We always get an alarm until we nearly run out of space entirely. That’s not really helping either.

So we it wasn’t so easily after all. There seems to be a discord between the way AWS reports free space and the way New Relic alarms. We tried several little math hacks at first such as going into negatives, but none of these worked. We did manage to hard-code a value of total RDS space to work the math and get the right value, but this would not be suitable for monitoring multiple RDS instances of different sizes, and the label of the metric in New Relic didn’t represent the data presented. Oh yes, and hard-coding magic numbers is never a good idea.

But our hard coding test did show us was that, as long as we could get the actual size of the RDS instance, then getting the right value in New Relic would be possible. We felt like we had 90% of the code we needed in the RDS plugin and most importantly it was already making calls to AWS using the RDS SDK.

The Solution

Feeling close to the answer, we pushed on and decided to develop our own plugin. The source code for the existing plugin was free to modify and distribute so we felt that was the right codebase to begin with. This particular plugin was initially developed using Ruby, but the New Relic platform also supports plugins developed using Java and Microsoft .Net.

One of the most basic requirements to creating your own plugin is to give it a GUID (Globally unique identifier). As our base plugin was in Ruby, this was a fairly basic change. We also changed the human readable name of our plugin slightly to avoid any confusion between our plugin and the original plugin.

At this point we hadn’t changed anything really in terms of the functionality. A quick test with the New Relic platform showed we were in fact posting the metrics and given we were now the ‘authors‘ of the plugin, we had the ability to configure our own dashboard, give the metrics meaningful titles and even designate the metrics we wanted to alarm on.

So with the plugin posting data to New Relic and the New Relic console allowing us to configure its display, we just needed the actual metric data. We figured that mathematically speaking, the calculation should look like this:

Percentage Used = (Total Space – Free Space)/ (Total Space / 100)

We settled on percentage-used because it would scale up, allowing us to configure New Relic to Alarm on a threshold like 90%.

We already had the Free Space value metric, so just needed to programmatically get the total RDS space value. To do this, we would need to make a call using the AWS Ruby SDK. Our plugin already had much of the framework for this in order to get the list of RDS instances to monitor.

To begin with, we initialized the call to the AWS SDK:

  instanceInfoRds = AWS::RDS.new(
    :access_key_id => @aws_access_key,
    :secret_access_key => @aws_secret_key,
    :region => @aws_region
  infoClient = instanceInfoRds.client

Our AWS Access credentials were already configured from the original version of the plugin. And the last line in this code snippet initializes a RDS client object to enable us to start making calls to the AWS API.

As our plugin already retrieved a list of all the RDS instances in our account, we could loop through the list of instances and make a simple call for each instance to get all the attributes of that instance:

  resp = infoClient.describe_db_instances({
    :db_instance_identifier => instance_id

For those not familiar with Ruby, we are simply using our new object to make a describe_db_instances call with the instance ID as its one parameter. Now in the resp object we have all the details of our RDS instance, including how much space it has allocated.

So now we could pull back our sought-after data:

  resp = infoClient.describe_db_instances({
    :db_instance_identifier => instance_id
  resp_hash = resp.data

  instanceStorage = 

  instanceByteStorage = 
     ((( instanceStorage * 1024 ) * 1024) * 1024)

Because the AWS calls for the free space were coming to us in bytes, we wanted our total allocated storage value to be in bytes as well. We could then plug this value into our initial equation and presto, we have the amount of used disk space.

Hooray! We had our value and it could be determined for any number of instances that were in our account. We just need to post it to New Relic now.

Because we were modifying an existing plugin, we followed it approach to building a collection of data points to send to New Relic. Deep within the bowels of our code, the final Ruby call to post the data to New Relic looks like this:

    component, "Component/#{metric_name}[#{unit}]", value

Which in our case basically resolved to:

    "Database Instance ID", "Component/ DiskSpaceUsed [percent]", 10

We were now receiving metrics in New Relic on how much space we were using as a percentage of the overall storage space. From here we could simply add it to our dashboards and alarms as required.

One Last Thing

Before we finished, we had one more idea. We had just made a call to the RDS API to determine the allocated storage. Were there any other pieces of information that could be useful to us? The short answer was yes: we could setup an alarm when our RDS instance failed over into another availability zone.

Our RDS instances incorporate the use of an AWS feature called Multi-AZ. At a high level, if there is a problem with the RDS instance, AWS can automatically fail over to another availability zone, keeping our application running. This is great for data reliability, but in our architectural model we prefer both the RDS instance and EC2 instance in the same zone for optimal performance.

When we build our cloud stack, we know what zone we want the primary RDS instance to be in. Furthermore, we could easily extend the plugin to return a value indicating what zone the RDS instance was actually running in. By simply assigning the primary zone a value of 1 and other zones a value higher than 1, we could add a New Relic metric to trigger an alarm if our RDS instance was being hosted in a zone other than the primary.

Adding this check required exactly the same technique that we used to setup the Disk Usage metric, but in this case, all we had to look for was a value greater than 1. We now had our very own RDS fail-over check, with only a couple of hours of effort.


We still have a little bit of polishing to do on our plugin but the hard work is done. Our modifications centred on the RDS component of the initial plugin but there is no reason that some of the other components (for example, EC2, ELB and DynamoDB) could not be modified in a similar fashion to enhance the monitoring of AWS stacks.

Whilst initially it might have been better if the AWS Plugin just worked for us, the flipside is that we wouldn’t have had the opportunity to use the custom plugin capabilities that New Relic provides, and we wouldn’t have developed an approach for monitoring RDS failover either. Now that we understand how to build our own plugin, we will have a greater degree of control of what we monitor in the future.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s