[EUNE/TR/EUW] Reconnect Issues over last 12 hours

Hey folks, Over the last 12 hours or so, we've had some frustrating issues within the TR/EUW/EUNE regions. Here's a summary of things: In troubleshooting the events impacting EUNE/TR/EUW regions, we incorrectly diagnosed the root cause, resulting in an extended, undetected impact to a small subset of players. The ultimate cause was a network link between two Riot datacenters failing, which had the knock-on effect of saturating a backup network link. This link was fully restored at 09:45 GMT this morning. Overall impact for most players was 1 hour, but a small group was affected for up to 10 hours. * For EUW/EUNE players: The bulk of players were able to reconnect and play LoL after routing re-converged due to the network link failover. Due to the way we optimize routing for players so they get the lowest latency possible, we did have a small subset of players in these regions continue to have connectivity issues until the primary network link was restored. * For TR players: For a subset of the TR Game Servers, this network link degradation continued to impact for the duration of the issue. We take service up-time and stability very seriously. Hopefully the above provides some context on the issues we've experienced recently. We already have Rioters assigned to look into the specific fails we experienced (tech, process, troubleshooting, etc.), and will be attacking these fails aggressively to prevent future issues. For those wanting more specifics... (all times are approximate for simplicity) 23:00 GMT: Riot NOC noticed reconnect spikes in EUNE/TR/EUW regions. Loss Forgiven for Ranked games was enabled for these regions. In researching, we found messaging from one of our service partners about a problem they were having, and we associated that this was the cause of our issue. Ultimately, this issue lead us down an incorrect path, as the below events did not actually affect LoL. * CloudFlare write-up of issue: https://blog.cloudflare.com/the-internet-is-a-cooperative-system-dns-outage-of-6-july-2015/ * DynDNS write-up: http://hub.dyn.com/h/i/104223007-update-managed-dns-issue-july-6-2015/87989 23:30 GMT: DNS issue resolved. NOC continues to monitor regions. 00:15 GMT: Loss Forgiven disabled and Ranked queues enabled in EUW and EUNE. TR still seeing reconnects, but they are decreasing. Loss Forgiven remains enabled for TR. Investigation continues. * _At this point, we believed EUW/EUNE fully restored. Yes, it was restored for the large majority of players, but as mentioned in the summary, there was a small subset of players still impacted._ 00:40 GMT: TR reconnects still trending downward. Loss Forgiven disabled and Ranked queues enabled for TR. NOC continues to monitor regions. 03:15 GMT: TR Ranked queues and LeaverBuster disabled as TR reconnects persist. Investigation continues. 07:45 GMT: Network link between two Riot datacenters down. Backup connectivity is overburdened due to recently added cross-site Game Server capacity. Affected TR Game Servers disabled, which clears up reconnect issues while the network link is investigated. Ranked queues and LeaverBuster enabled for TR. * Question: "Riot, why did it take so long to find the issue?" * Answer: _Few things. First, we simply mis-diagnosed the root cause and (incorrectly) attributed the impact to the DNS issue above. Second, degraded performance is usually more difficult to detect than full outages. In this case, we had a subset of the TR Game Servers using the congested network link, and only a subset of *that* subset was randomly impacted at any given time. Third, the timeframe of events happened during the lowest population times for these regions, making it more difficult to detect issues due to lower sample sizes. Lastly, and something we'll be addressing quickly, is how we can better monitor and detect these degradations sooner. Frankly, even with a less-clear degraded situation, we should still be able to detect and respond more rapidly. We can do better, and we will._ 09:45 GMT: Network link restored. NOC adds small batch of disabled TR game servers back into rotation, closely monitoring performance. 10:30 GMT: NOC confirms TR game servers performing as desired. Begins adding remaining servers into rotation.
Report as:
Offensive Spam Harassment Incorrect Board