This is the story of an investigation that I was recently involved in at a client. It took over a week to resolve, involved networking acronyms that seemed to increase in length each day (MTU…ICMP…PMTUD?!), and wound up with us learning all about the rather scary-sounding phenomenon of black-hole connections.
To set the scene, the app itself was a reasonably straightforward one: a Spring web service running in Tomcat. It sent and received traffic from an external SOAP web service via an IPSec site-to-site tunnel. On our side we were running OpenSwan, on their side, it was Cisco ASA.
The incident started like many do, on a Friday. Some of our developers had noticed timeouts in our web service when making calls to this external service.
Of course, as in most incidents, the first thing we asked ourselves, what’s changed? The last release was two weeks ago, no infrastructure changes, VPN hasn’t changed. At this point, we’re thinking this clearly has to be an issue on the third party side (particularly as this is an old single-threaded Oracle service which isn’t exactly known for its stability).
We immediately get on the phone to our friendly third party and let them know what we are seeing. To our relief, they advise that there is a known issue from a failed upgrade the previous day and that our issue is probably related to that. Good news from our point of view, we’ll just wait for them to fix the issue, and go home and enjoy our Friday evening.
But no, this is just the beginning of the story.
A few hours pass and our agents using the downstream application are getting increasingly frustrated with the inability to serve our customers due to this issue. People within the business are starting to take notice.
Escalation time! Back on the phone to our third party, time for some surprising news, we are told that the known issue we were advised of was related to a batch service that wasn’t even in the same realm as the real-time web service we were calling. Oh dear, hours of possible troubleshooting time lost and we are back to the start of our investigation. We are further advised that there have been no changes to the network or application layer at the third party. So now we are in the place no team wants to be: both sides have changed nothing, yet things are broken. Time to start digging.
A quick look at the logs doesn’t give us much to work with. We see many requests failing with the following:
javax.ws.rs.ProcessingException: java.net.SocketException: Unexpected end of file from server
The one thing it does tell us is that our web service is being cut off before receiving the response it’s expecting.
A quick look at our service call response times shows a concerning story, there’s a massive and consistent latency spike. Investigating the logs further reveals what we are starting to expect, problematic calls are simply timing out (as opposed to being actively terminated) without a response from our third-party service.
Looking at our error rate percentage shows a puzzling outcome: at the peak, we are only seeing ~15% of calls failing. So a reasonable amount of requests are successful.
Hmm, timeouts and sporadically failing requests. Surely this points to a flakey third-party service!? Let’s keep going.
After many hours of going around in circles, re-deploying services, rebooting servers and VPN’s we are no closer to an answer. In the early hours we call it a night and agree to resume early the following morning.
Starting bright and early the next morning we are all invited to an incident bridge, where we are advised of the significant impact this issue is having on our business partners and that we won’t be closing the bridge until the issue has been resolved. Wow, things just got serious.
As with any long-running incident, you get to the point where you realise your attempt for a quick win isn’t going to work. It’s time to dig deep and start isolating the parts of your service. We’re at that point.
First up, let’s take Java out of the equation. A couple of our engineers start extracting the payloads from problematic requests. Making these same requests from CURL, instead of our application, showed that the bad requests still fail and the good requests still return data. Our next thought is to try the same thing from our VPN server. Same result, bad requests timeout, good requests go through. Surely, now we have conclusive proof that it’s our third-party service that is failing? After all, we have taken the request right up until the point where it leaves our network.
After more incident calls, our third-party is adamant that nothing has changed on their side and that they aren’t even seeing the problematic requests arrive. At this point, nothing much is making sense. Neither side seems closer to a solution and we are starting to clutch at straws (after all, it’s entirely possible it’s a bug in the AWS network stack right? … right?).
So now what? We start looking for workarounds so we can buy ourselves some time while we try and find a root cause.
One engineer offers up a workaround, how about we spawn a child thread for the request and set a short time out on it. This is going to give our users an opportunity to quickly re-request instead of having to wait the full timeout period. Not a great experience, but better than where we are at at the moment.
It’s a relatively small code change and we are ready to test quickly. The code is deployed to our pre-production environment, which is displaying the same issue. Everyone on the phone bridge is eagerly waiting to confirm that this is a semi-viable workaround. “Wait a minute, the requests aren’t failing anymore, I can’t get any of the requests to fail with this code”, we quickly jump over to our production environment to confirm that requests are still failing in that environment, and indeed they are. So, suddenly we have a possible fix to the problem even though we still have no idea to its root cause.
Desperate times call for desperate measures though. At this point, we will take anything we can get. So we proceed to get the necessary approvals and push this “fix” into our production environment. There is a palpable sense of relief when we get the word from our users that this new code has resolved their issue for the time being. The decision is made at this point to stand down our incident team as some of use have been online for the better part of two and a half days. The fun part is still to come.
A couple of days later a few us of get together and workshop this issue for the entire day. We have a whole bunch of disparate evidence (tcpdump traces from every part of the chain, Java logs, squid logs and OpenSwan logs).
We start by digging through the packet captures from our application server. We can see a 3-way TCP handshake taking place, but then it’s essential radio silence, there is very little in the way of communication between our application and the third-party service. We jump over to the packet captures from our VPN server, this doesn’t really help us either.
After a few hours of documenting our architecture and brushing up on network protocol skills, we are not really any closer to a root cause. Later in the session one of our engineers stumbles on something interesting, we can’t reproduce the issue from our development environment, which is running the same code, talking to the same third-party service through the same VPN tunnel, huh.
After an hour or so of puzzling over this, the only difference we can find is that our development environment isn’t running our Java monitoring agent that talks back to the monitoring vendor. We quickly jump over to our pre-production environment and disable this agent and then re-run tests with the problematic code. 100% success, we can’t make it fail. We switch back and forth 5 or 6 times and repeat the test just to be sure that this is really happening, our monitoring agent is indeed causing some requests to fail when it is enabled.
Over the next week, one of our engineers continues to work on this problem and starts to pull out another possible clue. When the monitoring agent is enabled and a request fails we notice that it is adding a custom HTTP header that it is using to enable a feature that allows you to trace requests through multiple services. The catch here is not every request has this header, it’s only added to a sample of requests. Ah, now we are starting to see why only a certain percentage of requests were failing during our incident.
During further investigation, it’s found that we can replicate this behaviour by adding a certain number of characters to this header. We contact our third-party service to see if they have any arbitrary restriction on HTTP header size and we are told they definitely don’t. After some continued back and forth with their network engineers, we find something interesting, a network packet of length 1420 bytes will go through fine, a packet of length 1421 will fail. So it’s not the monitoring agent specifically, but the fact that its header pushes specific packets over a certain byte length!
At this point in time, I’m sure anyone reading this article who has worked in networking is starting to suspect that what we have on our hands is an issue related to MTU or packet fragmentation.
Before proceeding any further, if you aren’t familiar with how packet transmission takes place, I would recommend having a read through of this article.
I did what I think most engineers do when they at the edge of their knowledge about a specific subject: I started googling things.
The more I read, the more likely it seemed that what we were seeing was an issue related to Path MTU Discovery or PMTUD. In short, PMTUD is a technique designed to allow a network transaction to have the optimal packet size when packets are transiting the many possible hops between a source and destination. It does this to avoid IP fragmentation, and the inefficiencies that come with that, as much as possible.
I started to see multiple references to people turning off PMTUD when using OpenSwan (and variants like LibreSwan and StrongSwan) due to issues. The general recommendation was to run this command on your VPN server.
echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc
This essentially turns off PMTUD on your server and sets a static MTU value for all packets. Luckily, I had a second VPN server that wasn’t in production that I could test on. I set this value and immediately the previous broken requests started working. The end is in sight! Setting our MTU size to a fixed value seemed like an unsatisfactory solution though, so I decided to do some more digging on why PMTUD wasn’t working.
This statement on the PMTUD article from Wikipedia caught my eye
Many network security devices block all ICMP messages for perceived security benefits, including the errors that are necessary for the proper operation of PMTUD. This can result in connections that complete the TCP three-way handshake correctly, but then hang when data is transferred. This state is referred to as a black hole connection.
Had we inadvertently stumbled into a black hole? So what is an ICMP black hole in relation to PMTUD?
When a source sends a packet out, it hops through many points on the network before reaching its destination. Along the way, each hop will check to see if the size of the packet is larger than it will accept; if it is, then it will attempt to send an ICMP fragmentation required packet back to the source. The source should then reduce the packet size and resend. A black hole occurs when the source blocks these ICMP packets. The ICMP packet is being blocked, so the source doesn’t know that the data packet has been dropped along the way. It also doesn’t get the relevant ICMP packet, so it is a state of limbo.
After learning this, I went back to our AWS account and checked whether we were indeed not allowing these ICMP packets to flow in. Sure enough, we did not open up this rule in our NACLs and Security Groups.
The fix? Well, it’s made pretty simple in AWS. You can open up the relevant ICMP packet in your NACLs as well as the Security Group attached to your instance.
I applied these settings, removed the previous Linux level PMTUD change and proceeded to test. Sure enough, everything was working correctly all of a sudden. Not only could we bump the packet size over the previously broken limit, but we could add MUCH bigger headers and still have communication flow correctly.
The only outstanding question we still have is what bumped us over the limit in terms of packet size. Our only guess is that our cloud monitoring solution enabled a feature that slightly increased the header size in the packet. If that is the case, then this was a problem waiting to happen.
I’m hoping that if there is anyone out there struggling through the same issue, they might be able to save some time if they come across this article.