Following a misconfiguration in one of its data centers, the US internet service provider (ISP) CenturyLink suffered a major technical outage that spread across the internet taking down many popular sites and services on Sunday.
The error at the company’s data center spread outward from its network and also ended up impacting other ISPs, which led to connectivity problems for many other companies including Amazon, Twitter, NameCheap, OpenDNS, Reddit, Discord, Hulu, Steam and others.
Cloudflare was also severely impacted by CenturyLink’s outage and in a blog post, CEO and co-founder of the web infrastructure and website security company Matthew Prince explained how the incident affected the internet as a whole, saying:
“Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.”
Incorrect Flowspec rule
Based on information from a CenturyLink status page, it appears the issue originated in the ISP’s CA3 data center in Mississauga, located in Canada’s Ontario province.
As its own services were affected by the outage, Cloudflare paid close attention and believes that an incorrect Flowspec rule that came at the end of a long list of BGP updates may have caused it.
If this was the case, every router in CenturyLink/Level(3)’s network would have received the Flowspec rule and started blocking BGP, which would lead them to stop receiving the rule.
The devices would then start back up, work their way through all the BGP rules until they got to the incorrect Flowspec rule and BGP would once again be dropped, creating an endless loop.
BGP routes are a type of message that internet companies relay between each other to inform each internet provider which group of IP addresses is available on their network. However, CenturyLink’s incorrect Flowspec rule also brought down some routers outside of its network which began to announce incorrect BGP routes to other Tier 1 internet services. This brought down other networks, causing the major internet outage experienced over the weekend.
Thankfully, CenturyLink was able to fix the issue by telling all other Tier 1 internet providers to ignore any traffic coming from its network. This type of action is usually a last resort as it results in all of the company’s customers losing internet connectivity.