The recent SDN & OpenFlow World Congress in Düsseldorf attracted a fascinating mix of attendees. On one side were long-time veterans of the telecom industry exploring the opportunities that virtualization is bringing to service provider networks. On the other side were IT and cloud experts working on the challenges of extending their infrastructures to support telecom services.

The topic bringing these two groups together, of course, is network functions virtualization (NFV). The promise of NFV is that a combination of virtualization and “cloudification” will enable service providers both to reduce their opex through improved network efficiency and to improve their top-line revenues through the agile delivery of new, value-added services. In order to successfully achieve this goal, IT teams and networking teams are going to have to work together in unprecedented ways. Each group approaches the challenges from a different perspective and with a different set of experiences.

One area that causes a lot of confusion and misunderstanding for folks with a background in IT and cloud infrastructure is the whole topic of “Carrier Grade” reliability for telecom services. More and more vendors are starting to use Carrier Grade terminology in connection with their products, but the requirements and challenges of Carrier Grade reliability are very different from what many people have had to deal with before, while the telecom industry of course brings its own alphabet soup of confusing acronyms and terminology.

In this post, I’ll outline some of the myths about Carrier Grade often encountered when demonstrating NFV solutions to conference attendees whose main focus until now has been on enterprise-type applications.

Myth 1: Carrier Grade reliability has no direct impact on service provider revenues

In 2014, Heavy Reading published a detailed analysis titled “Mobile Network Outages & Service Degradations” that discussed the business impact of network outages. The report calculated that during the 12 months ended October 2013 service providers worldwide lost approximately $15 billion through such outages, representing between 1% and 5% of their total revenues. All major service providers were affected.

There are several sources of this lost revenue. First, there’s the increased rate of subscriber churn – dissatisfied customers taking their business elsewhere. Second, there are the operational expenses incurred to fix the problems. Third, service providers lose the ability to capture revenue from a billable service if it’s unavailable. Fourth, future revenues are impacted due to damage to brand reputation. Fifth, refunds must be paid to enterprise customers with service-level agreements (SLAs) that guarantee a certain level of uptime. And finally, there are inevitably legal costs relating to SLA issues.

It’s important to note that this analysis relates to a 12-month period ending in 2013, when service provider infrastructure was completely based on physical equipment, typically with high reliability proven over many years’ deployments and before any adoption of network virtualization.

NFV has the potential to make this situation much worse: Services and applications will now be virtualized; they will be new and unproven; virtual machines (VMs) will be dynamically reallocated across servers, racks, and even data centers; traffic flows will be more complex and hard to debug; solutions will inevitably be multi-vendor rather than from a single supplier.

As they progressively adopt NFV, it’s a business imperative for service providers to maintain Carrier Grade reliability for their critical services and high-value customers. Otherwise their overall uptime will decrease, further impinging on their revenues and negating one of the key reasons (top-line growth) for moving to NFV in the first place.

Myth 2: Carrier Grade reliability is a stand-alone “feature” that you can add to your infrastructure

It’s extremely difficult to develop network infrastructure that delivers Carrier Grade reliability. Multiple, complex technologies are needed in order to guarantee six-nines (99.9999%) reliability at the infrastructure level so that services can achieve five-nines uptime.

Looking first at what it takes to guarantee network availability for virtualized applications, an optimized hypervisor is required that minimizes the duration of outages during the live migration of VMs. The standard implementation of KVM, for example, doesn’t provide the response time that’s required to minimize downtime during orchestration operations for power management, software upgrades, or reliable spare reconfiguration. In order to respond to failures of physical or virtual elements within the platform, the management software must be able to detect failed controllers, and hosts or VMs very quick launch self-healing actions, so that service impact is minimized or eliminated when failovers occur. The system must automatically act to recover failed components and to restore sparing capability if that has been degraded. To do this, the platform must provide a full range of Carrier Grade availability APIs (shutdown notification, VM monitoring, live migration deferral, etc.), compatible with the needs of the OSS, orchestrator, and VNFs. The software design must ensure there is no single point of failure that can bring down a network component, nor any “silent” VM failures that can go undetected.

Second, network security requirements present major challenges. Carrier Grade security cannot be implemented as a collection of bolt-on enhancements to enterprise-class software – rather it must be designed-in from the start as a set of coordinated, fully embedded features. These features include: full protection for the program store and hypervisor; AAA (authentication, authorization, accounting) security for the configuration and control point; rate limiting, overload, and denial-of-service (DoS) protection to secure critical network and inter-VM connectivity; encryption and localization of tenant data; secure, isolated VM networks; secure password management; and the prevention of OpenStack component spoofing.

Third, a Carrier Grade network has stringent performance requirements, in terms of both throughput and latency. The host virtual switch (vSwitch) must deliver high bandwidth to the guest VMs over secure tunnels. At the same time, the processor resources used by the vSwitch must be minimized, because service providers derive revenue from resources used to run services and applications, not those consumed by switching. The data plane processing functions running in the VMs must be accelerated to maximize the revenue-generating payload per watt. In terms of latency constraints, the platform must ensure a deterministic interrupt latency of 10µs or less, in order for virtualization to be feasible for the most demanding CPE and access functions, such as C-RAN. Live migration of VMs must occur with an outage time less than 200ms, using a “share nothing” model in which all a subscriber’s data and state are transferred as part of the migration. The share nothing model, used in preference to the shared storage model in enterprise software, ensures that legacy applications are fully supported without needing to be rewritten for deployment in NFV.

Finally, key capabilities must be provided for network management. To eliminate the need for planned maintenance downtime windows, the system must support hitless software upgrades and hitless patches. The backup and recovery system must be fully integrated with the platform software. And support must be implemented for “Northbound” APIs that interface the infrastructure platform to the OSS/BSS and NFV orchestrator, including SNMP, Netconf, XML, REST APIs, OpenStack plug-ins, and ACPI.

You can’t achieve these challenging requirements by starting from enterprise-class software that was originally developed for IT applications. This type of software usually achieves three-nines (99.9%) reliability, equivalent to downtime of almost nine hours per year.

Myth 3: Carrier Grade reliability can be implemented in the network applications themselves

There’s been a lot of industry discussion recently about Application-Level High Availability (HA). This concept places the burden of ensuring service-level reliability on the applications themselves, which in an NFV implementation are the VNFs. If it’s achievable, it’s an attractive idea because it means that the underlying NFV infrastructure (NFVI) could be based on a simple open-source or enterprise-grade platform.

Even though such platforms, designed for IT applications, typically only achieve three-nines reliability, that would be acceptable if the applications themselves could recover from any potential platform failures, power disruptions, network attacks, link failures, etc., while also maintaining their operation during server maintenance events.

Unfortunately, Application-Level HA by itself doesn’t achieve these goals. No matter which of the standard HA configurations you choose (Active / Standby, Active / Active, N-Way Active with load balancing), it won’t be sufficient to ensure Carrier Grade reliability at the platform level.

In order to ensure five-nines availability for services delivered in an NFV implementation, you need a system that guarantees six-nines uptime at the platform level, so that the platform can detect and recover from failures quickly enough to maintain operation of the services. This implies that the platform needs to deal with a wide range of disruptive events that cannot be addressed by the applications because they don’t have the right level of system awareness or platform management capability.

Myth 4: Carrier Grade reliability is something you get from the OPNFV project

Formally launched in September 2014, the Open Platform for NFV (OPNFV) project is an open-source reference platform intended to accelerate the introduction of NFV solutions and services. OPNFV operates under the Linux Foundation, and the primary goal of the project is to implement the ETSI specification for NFV.

Several service providers have been quoted publicly as confirming that they see the OPNFV reference platform as a way to accelerate the transition from the standards established by ETSI to actual NFV deployments. Of course they recognize that OPNFV code can’t be directly deployed into live networks, anticipating that software companies will use OPNFV as the baseline for commercial solutions with full SLA support.

OPNFV’s initial focus is NFV infrastructure (NFVI) and virtualized infrastructure management (VIM) software, implemented by integrating components from upstream projects such as OpenDaylight, OpenStack, Ceph Storage, KVM, Open vSwitch, and Linux. Along with application programming interfaces (APIs) to other NFV elements, these NFVI and VIM components form the basic infrastructure required for hosting VNFs and interfacing to Management and Network Orchestration (MANO).

The first OPNFV release “Arno” became available in June 2015. Arno is a developer-focused release that includes the NFVI and VIM components. The combination offers the ability to deploy and connect VNFs in a cloud architecture based on OpenStack and OpenDaylight. The next release “Brahmaputra” is planned as the first “lab-ready” release, incorporating numerous enhancements in areas such as installation, installable artifacts, continuous integration, improved documentation, and sample test scenarios.

Neither Arno nor Brahmaputra, however, incorporates any features that contribute to delivering Carrier Grade reliability in the NFVI platform. This is an example of an area where companies with proven experience in delivering six-nines infrastructure will continue to add critical value.