Reinventing virtualization with the AWS Nitro System

• 1816 words

Running a business at the scale of Amazon, we often have to solve problems that no other company has faced before. The disadvantage of this is that there is no “how to” guide for us—a lot is unknown. However, the advantage is that when we solve a new problem, it’s an opportunity to reinvent our services and create new benefits for our customers. Indeed, we have created some of our most innovative and successful ideas when we have entered unchartered territory.

When you’re a customer-centric company, you often find yourself in the great unknown because customers will always want more and better. You will need to invent on their behalf. A great example of this approach to innovation and problem solving is the creation of the AWS Nitro System (Nitro System), the underlying platform for our EC2 instances.

After years of optimizing traditional virtualization systems to the limit, we knew we had to make a dramatic change in the architecture if we were going to continue to increase performance and security for our customers. This realization forced us to rethink everything and became the spark for our creating the Nitro System, the first infrastructure platform to offload virtualization functions to dedicated hardware and software. Now, with the Nitro System, we can offer the best price performance in the cloud, the most secure environment, and a faster pace of innovation.

Let’s look at the journey our team has been on in creating the Nitro System and what the result has meant for our customers.

In the beginning

A hypervisor is a piece of system software that provides virtual machines (VMs), on which users can run their OS and applications. The hypervisor provides isolation between VMs, which run independent of each other, and allows different VMs to run their own OS. An off-the-shelf hypervisor was never intended to be used in a multitenant cloud environment. But with deep tuning and customization, hypervisors can be adapted for true multitenancy, which simplifies machine provisioning and administration while increasing utilization and lowering costs for customers.

In the early days of EC2, we used the Xen hypervisor, which is purely software-based, to protect the physical hardware and system firmware; virtualize the CPU, storage, and networking; and provide a rich set of management capabilities. But with this architecture, as much as 30% of the resources in an instance were allocated to the hypervisor and operational management for network, storage, and monitoring.

Figure 1: EC2 Instance host architecture for the Xen Hypervisor

Thirty percent is significant, and this waste wasn't providing direct value to our customers. It became clear to us that if we wanted to significantly improve performance, security, and agility for our customers, we had to migrate most of our hypervisor functionalities to dedicated hardware. That’s when we began our journey of designing the Nitro System in 2012.

The one-way door

At Amazon, we often talk about one-way and two-way door decisions. A two-way door decision is easily reversible, like testing out a new web page format. With this type of decision, you can move fast because, even if it takes a little time, you can reverse the decision. A one-way door decision is almost impossible to reverse, so you have to make it methodically, carefully, slowly, and with great deliberation and consultation.

Creating the Nitro System was a one-way door decision. We knew that we had outgrown the capabilities of traditional virtualization techniques. We had to innovate. But we did not make the decision quickly or lightly. The journey consisted of careful trial and error over the course of five years, with each step validating the direction we were taking.

The Nitro System is comprised of three main parts: the Nitro Cards, the Nitro Security Chip, and the Nitro Hypervisor. The Nitro Cards are a family of cards that offloads and accelerates IO for functions, including Virtual Private Cloud (VPC), Elastic Block Store (EBS), and Instance Storage, thereby increasing overall system performance.

We launched our first Nitro offload card in the C3 instance type in 2013, offloading our network processes into hardware. Next came the C4 instance type in 2014, offloading EBS storage into hardware. For the C4 instance type, we worked for the first time with a company called Annapurna Labs. We were so impressed by the technology and the team there that we acquired Annapurna Labs in early 2015. By 2017, we had offloaded the last of the components, including the control plane and the remaining I/O, and we introduced a new hypervisor, the full Nitro System with the C5 instance type.

This was an incredible moment for us. Building hardware is challenging. Not only was it a major investment financially, but it was also a huge time commitment for so many employees. The hard work was worth it when the Nitro System launched.

Figure 2: 2017 Nitro System architecture

The Nitro architecture also enabled us to make the hypervisor layer optional and offer bare metal instances. Bare metal instances provide applications with direct access to the processor and memory resources of the underlying server.

This is important for workloads that require access to the hardware feature set, such as Intel® VT-x, and for applications that need to run in non-virtualized environments for licensing or support requirements. For example, I3 bare metal instances enable VMware to run their full Software-Defined Data Center (SDDC) stack, including the ESXi hypervisor, directly on AWS managed infrastructure.

So what does this all mean for our customers? Better performance and price, enhanced security, and a faster pace of innovation.

The customer impact: better performance and price

With the Nitro System, EC2 performs better across CPU, networking, and storage because we moved those functions into dedicated Nitro cards. Not having to hold back resources for management software means more savings that can be passed on to the customer.

The Nitro System also impacts hypervisor jitter. In the cloud, the ability to deliver a reduction of jitter down to microseconds enables scenarios that otherwise couldn’t exist. For example, we have a customer that manages satellites, and it needed a real-time compute environment to support communication to its fleet. Specifically, response to a network packet had to be within 150 µs for the workload to function. Traditional hypervisors simply cannot support this type of workload.

Figure 3 below shows the jitter comparison for this customer when it used different instance types: C4 (pre-Nitro), C5, and I3.metal. As you can see, there is significantly lower jitter with C5 and I3.metal. You can also see that the performance impact from the hypervisor in a non-bare metal instance is light.

Figure 3: Nitro System hypervisor jitter improvements

In addition, the Nitro System improves storage latency. In Figure 4, you can see the storage latency improvement achieved by offloading the overhead from local disk storage. In fact, our network optimized instances have seen performance increase by up to 4X. Even the R5d instance, which is not a storage optimized instance, offers better latency compared to fourth-generation instances.

Figure 4: Nitro System instance storage improvements

Finally, the Nitro System also provides enhanced networking performance. AWS is the first and only cloud to offer 100 Gbps enhanced ethernet networking. This is beneficial for workloads that require higher throughput or are network bound, like HPC applications. This is only possible through the Nitro System.

Enhanced security

One of the biggest benefits of the Nitro System is enhanced security. First, we designed the Nitro System to operate in the most hostile of networks we could imagine. This means not only encrypting all communication channels but also providing secure booting capabilities. Although our data center networks are highly secure, this design has enabled us to launch new products. For example, last year, we introduced AWS Outposts, which brings the AWS experience to an on-premises data center with the Nitro System.

Second, we have engineered the system with a hardware-based root of trust using the Nitro Security Chip, allowing us to cryptographically measure and validate the system continuously. This provides a significantly higher level of trust in what is running than can be achieved with traditional hardware. By offloading and simplifying the overall stack, we also minimize the Trusted Computing Base (TCB), increasing confidence in the overall system.

Third, we have designed the Nitro System to have very limited operator accessibility. In a typical off-the-shelf hypervisor, an administrator has full access to the system and can modify any component. In contrast, with the Nitro System, the only interface for operators is a restricted API, making it impossible to access customer data or mutate the system in unapproved ways. There is no equivalent of a “root” user or SSH, and as a result, the Nitro System provides a level of confidence that can’t be obtained by simply locking down a traditional hypervisor.

Fourth, traditional virtualization and that of other cloud providers use general purpose servers. By their very nature, these include extra and unnecessary components and capabilities. This increases the surface area for security vulnerabilities. By contrast, the Nitro System is a huge step forward in using purpose-built hardware and servers designed specifically to run a hypervisor—nothing more. Not only does this reduce the risk of security vulnerabilities, but it also gives us the ability to offload specific functions to dedicated hardware and software, which further minimizes the attack surface of the hypervisor.

Finally, with the Nitro System, we can apply formal verification that proves the Nitro System has no bugs and works the way it is intended to.

Faster pace of innovation

With the Nitro System, we were able to break the architecture of EC2 into smaller blocks by offloading the virtualization functions onto dedicated hardware. These blocks can be assembled in many different ways, giving us the flexibility to design and rapidly deliver EC2 instances with an ever-broadening selection of compute, storage, memory, and networking options. As you can see from the “Nitro enabled innovation” chart below, we have launched nearly 4x the number of instances since launching the Nitro System in 2017. As a result, our customers have a broader tool set to choose from as they optimize price and performance. The faster we innovate, the faster our customers can innovate.

Figure 5: Nitro System enabled innovation

Continuous innovation

Whether you chose to look at something as a problem or an opportunity can have large effect on how you deal with it. We have chosen to look at the limitations of the traditional hypervisor as an opportunity to create a completely new architecture.

Now, with the Nitro System, our customers enjoy better performance, enhanced security, and a broader set of instance types to choose from. And we’re not done yet. When you’re a customer-focused company, your products will never be finished because customers always want more, and they always want better. And I look forward to giving this to them every time.

You can learn more about the AWS Nitro System on our web site.