[EUW/EUNE/TR/RU] Post-incident messaging - 17.11.2015

**TL;DR: We experienced a routing issue yesterday that resulted in a subset of players not being able to log on EUW, TR and RU. While we found and corrected the issue, many players were disconnected from the game and platform (all, in the case of EUW).** Hi everyone, Here’s what happened yesterday, and ultimately what lead to the loss of connectivity to the EUW server, Turkey & Russia, as well as a percentage of the EUNE server. Starting at about 8AM GMT+1 (17/11), we started receiving reports from players who were unable to log on EUW, and we started gathering data to try and figure out if there were any similarities with the reports, etc. The engineering teams related to the incident also joined in on the investigation and we worked meticulously on removing potential causes. We also focused on reducing the problem space from "All the things are broken" to "system XYZ is the specific issue. What was happening is that a small subset of players were unable to connect to the game, and their geographical location would be all over the place. There was no indication of any kind of location specific problem. And all three servers (minus EUNE, as small percentage) were hit by this issue, but EUW, RU and TR would be in different locations from where “authentication” to the servers would happen. Basically, we had a random number of sources, and multiple destinations affected. Eventually during the investigation, we were led to believe that the issue would be unrelated between EUW, and TR/RU (due to different location, as mentioned above) and investigated down that path, treating each as a separate issue. We covered transit networking as a potential root cause, as well as general datacenter networking, then we proceeded with individual lookup and ensuring that communication between the required pieces was working. Every way we looked at the issue, we were unable to find smoking guns but we knew players would still be facing login issues, despite multiple attempts on our side to fine-tune and tweak what we could find that’d look “odd”. A thing that kept coming back throughout the day was the CloudFlare errors that would be received, specifically 522 (or sad Amumu), which really just meant connection timed out between CloudFlare and Riot. Yet our look at network traffic, packet capture etc, was still looking OK. At about 3PM GMT+1, we caught a break in the investigation and what was causing the issue for one player, and expanded from that. We traced the traffic between two infrastructure points and noticed it would disappear at one of the points – we tried forcing the data to come and go through another route and confirmed that the player was able to connect. From there, we figured something with the routing engine of our hardware had to be acting oddly, kept digging and we ultimately found that we had that subset of routes (and therefore the Players' traffic using those routes) that would get stuck in a communication loop between two points on Riot’s European infrastructure. When we tried to resolve the issue at about 3:30PM GMT+1, we unfortunately disrupted the experience for you all, and EUW ended up booting everyone off the platform and games, TR and RU ended up having a smaller scale impact, and so did EUNE. We quickly activated Loss Forgiven for everyone and put up a ticker to let you know we were aware of the issue. As we resolved all the issues from the day, we monitored players getting back on the platforms and being able to play normally, along with the players that had been locked out previously! We considered the incident resolved at about 5PM GMT+1, once we confirmed everything was back to normal. Essentially, what happened was that a routing table ended up in a bad state and would send a very small portion of traffic through what it considered a “preferred” route; but where it would send the data was not at all where it was supposed to go – so players with login issues would “timeout” and notreceive a response from the Login servers. What brought this on is unknown at this stage, and while it would be easy to assume that the maintenance that occurred earlier that day was at cause, it is important to note that the underlying cause was present for much longer. It might have exacerbated the issue, sure, but it was there before, hiding in the shadows. The good news is that this might have resolved some obscure issues where players would sometime have login issues. Thanks everyone that made it this far; and good luck on the Rift!
Share
Report as:
Offensive Spam Harassment Incorrect Board
Cancel