After the previous week’s major outage at Amazon, this last week we witnessed Verizon to suffer an outage on its new LTE network. Starting on Tuesday April 26th, customers with HTC Thunderbolt smartphones, USB sticks and portable hotspots have noticed not being able to register with the entire network across the USA. Outage was acknowledged by Verizon on Wednesday morning. By Wednesday afternoon, Verizon reported that it had found the root cause and planning to restore service market-by-market. By Thursday morning Verizon announced the restoration of service with a tweet.Throughout the outage period customers expressed frustration with not being able to connect to EV-DO network either. This was primarily due to faulty modem settings in Thunderbolt and some of the modems. Considering there are about 500-600 K customers with LTE devices, overall outage impact was fairly small for Verizon.
On last Thursday, Simon Leopold, analyst at private investment firm Morgan Keegan released a research note pointing to the NSN Home Subscriber System (HSS) as the root cause of the problem. Network-wide nature of the problem and the long outage period gave credence to the claim. However, as of Monday, May 2nd, there is no confirmation of the root cause from Verizon.
Verizon revealed their IMS core vendor choices at the IMS World Forum in Barcelona, Spain in mid-April. These are:
- NSN Home Subscriber System (HSS)
- Tekelec/Camiant Policy Control Resource Function (PCRF)
- Acme Packet Session Border Controller (SBC)
- Alcatel Lucent Call Session Control Function (CSCF)
Context of the Verizon presentation at the IMS World Forum was Voice over LTE (VOLTE) implementation planning. However, HSS and PCRF are already at use for LTE customers providing subscriber registration/authentication (HSS) and management of service/traffic handling policy (PCRF) capabilities.
Following diagram from an Alcatel-Lucent LTE poster describes the interworking between CDMA2000 EV-DO network Verizon operates and the newly deployed LTE network. Instead of going into great details of the release 8 standards, we can point out that Packet Data Network (PDN)-Gateway is the anchor point for the mobility between EV-DO and LTE networks. Rather than relying on client based mobile IP, 3GPP release 8 standards rely on Proxy Mobile IP which is implemented both in the Serving-Gateway (S-GW) and the Access Network Packet Data Serving Node (PDSN). Verizon has a dual-stack implementation that provides both IPv6 and IPv4 addresses when the device is connected to LTE. When it uses EV-DO, the device gets an IPv4 address. Therefore, true network mobility is currently supported for IPv4 only.
Based on the symptoms of the outage, it appears more likely that Verizon suffered an HSS outage. In the IMS World Forum, Verizon reported that it has built IMS core over two geographically separate data centers. (As you’d expect from a telco company with a long pedigree of reliability standardization unlike those who suffered through the Amazon outage helplessly during the previous week.) However, it appears that even a geo-redundant network wasn’t sufficient to prevent the outage. We believe Verizon was operating its HSS in an active-active fashion that required a very tight synchronization between the two database instances. There are two categories of service impacts such a system may suffer from:
- A network failure or a process failure that blocks the synchronization between the two active databases.
- A network, software or hardware failure that renders one of the active systems unable to process incoming requests from network elements.
In the first scenario, the system rapidly moves into a state where the (geographical) high availability capability is lost. However, this scenario may not be catastrophic if the operator can manage/regulate the incoming load (until the next maintenance window) and then execute an emergency replication between different databases.
In the second scenario, system should have been designed such that HSS in one geographical location can handle the entire system-wide load. Unfortunately this is easier said than done since transactional systems like HSS may have very high transient loads compared to the steady state traffic. To better explain, imagine if half the LTE devices in the Verizon network were attempting to register with the alternative HSS in a very short time (in a few minutes). Certainly all of a sudden the service rate for so many requests may be many 1000s of requests per second. It is quite possible that HSS might have suffered such a temporary/transient problem that affected the existing normally operating LTE devices as well. In such a scenario, in a very short time, the network may move into a state where every device constantly tries to register with the network whereas network cannot handle the total offered load at once. Verizon’s explanation of restoring service market by market makes us think that what we listed above is a plausible scenario for last week’s outage.
One way to reduce the likelihood of such outages would be to make the HSS more distributed. This has to be balanced against the desire to make provisioning more manageable. Nevertheless, it might be possible to develop a network design where HSS database as well as its components serving signaling requests coming from network elements such as Mobility Management Entity (MME) are further distributed. Another constraint for this would be the propagation delay that would ultimately affect the latency of the synchronization traffic.
Another strategy would be the use of tight operational procedures to shut down network in case of similar high transient load induced instabilities. Certainly this would not eliminate the outage but it will help to reduce the impact and duration of outages. We suspect Verizon wasn’t expecting to see a 24-hour impact on its network.
Moving to all IP networking with LTE is a revolutionary change that impacts the entire network including radio, transport and core. Even though all-IP networking provides a flatter network architecture, there are still hierarchical nodes in the mobile network such as HSS, PCRF or Charging Platforms. Although they are not single points of failure, they have a much bigger impact on the network service levels compared to the access network. Therefore, they need to be designed with a different level of care and attention to detail. We believe this outage was a big learning experience for Verizon. They must have conducted their root cause analysis and they will make the necessary corrections. Now the remaining question is if they will follow in the footsteps of Amazon to refund their customers for loss of service.