An unexpected discovery: Automated reasoning often makes systems more efficient and easier to maintain

TutoSartup excerpt from this article:
Automated reasoning also gives our builders confidence to explore additional optimizations that improve system performance even further… Underneath that simplicity lie vast, complex distributed systems that process billions of requests every second… Verifying the correctness of these complex sy…

During a recent visit to the Defense Advanced Research Projects Agency (DARPA), I mentioned a trend that piqued their interest: Over the last 10 years of applying automated reasoning at Amazon Web Services (AWS), we’ve found that formally verified code is often more performant than the unverified code it replaces.

The reason is that the bug fixes we make during the process of formal verification often positively impact the code’s runtime. Automated reasoning also gives our builders confidence to explore additional optimizations that improve system performance even further. We’ve found that formally verified code is easier to update, modify, and operate, leading to fewer late-night log analysis and debugging sessions. In this post, I’ll share three examples that came up during my discussions with DARPA.

Automated reasoning: The basics

At AWS, we strive to build services that are simple and intuitive for our customers. Underneath that simplicity lie vast, complex distributed systems that process billions of requests every second. Verifying the correctness of these complex systems is a significant challenge. Our production services are in a constant state of evolution as we introduce new features, redesign components, enhance security, and optimize performance. Many of these changes are complex themselves, and must be made without impacting the security or resilience of AWS or our customers.

Design reviews, code audits, stress testing, and fault injection are all invaluable tools we use regularly, and always will. However, we’ve found that we need to supplement these techniques in order to confirm correctness in many cases. Subtle bugs can still escape detection, particularly in large-scale, fault-tolerant architectures. And some issues might even be rooted in the original system design, rather than implementation flaws. As our services have grown in scale and complexity, we’ve had to supplement traditional testing approaches with more powerful techniques based on math and logic. This is where the branch of artificial intelligence (AI) called automated reasoning comes into play.

While traditional testing focuses on validating system behavior under specific scenarios, automated reasoning aims to use logic to verify system behavior under any possible scenario. In even a moderately complex system, it would take an intractably large amount of time to reproduce every combination of possible states and parameters that may occur. With automated reasoning, it’s possible to achieve the same effect quickly and efficiently by computing a logical proof of the correctness of the system.

Using automated reasoning requires our builders to have a different mindset. Instead of trying to think about all possible input scenarios and how they might go wrong, we define how the system should work and identify the conditions that must be met in order for it to behave correctly. Then we can verify that those conditions are true by using mathematical proof. In other words, we can verify that the system is correct.

Automated reasoning views a system’s specification and implementation in mathematics, then applies algorithmic approaches to verify that the mathematical representation of the system satisfies the specification. By encoding our systems as mathematical systems and reasoning about them using formal logic, automated reasoning allows us to efficiently and authoritatively answer critical questions about the systems’ future behavior. What can the system do? What will it do? What can it never do? Automated reasoning can help answer these questions for even the most complex, large-scale, and potentially unbounded systems—scenarios that are impossible to exhaustively validate through traditional testing alone.

Does automated reasoning allow us to achieve perfection? No, because it still depends on certain assumptions about the correct behavior of the components of a system and the relationship between the system and the model of its environment. For example, the model of a system might incorrectly assume that underlying components such as compilers and processors don’t have any bugs (although it is possible to formally verify those components as well). That said, automated reasoning allows us to achieve higher confidence in correctness than is possible by using traditional software development and testing methods.

Faster development

Automated reasoning is not just for mathematicians and scientists. Our Amazon Simple Storage Service (Amazon S3) engineers use automated reasoning every day to prevent bugs. Behind the simple interface of S3 is one of the world’s largest and most complex distributed systems, holding 400 trillion objects, exabytes of data, and regularly processing over 150 million requests per second. S3 is composed of many subsystems that are distributed systems in their own right, many consisting of tens of thousands of machines. New features are being added all the time, while S3 is under heavy use by our customers.

A key component of S3 is the S3 index subsystem, an object metadata store that enables fast data lookups. This component contains a very large, complex data structure and intricate, optimized algorithms. Because the algorithms are difficult for humans to get right at S3 scale, and because we can’t afford errors in S3 lookups, we made new improvements on a cadence of about once per quarter, due to the extreme care and extensive testing required to confidently make a change.

S3 is a well-built and well-tested system built on 15 years of experience. However, there was a bug in the S3 index subsystem for which we couldn’t determine the root cause for some time. The system was able to automatically recover from the exception, so its presence didn’t impact the behavior of the system. Still, we were not satisfied.

Why was this bug around so long? Distributed systems like S3 have a large number of components, each with their own corner cases, and a number of corner cases happen at the same time. In the case of S3, which has over 300 microservices, the number of potential combinations of these corner cases is enormous. It’s not possible for developers to think through each of these corner cases, even when they have evidence the bug exists and ideas about its root cause—never mind all of the possible combinations of corner cases.

This complexity drove us to look at how we could use automated reasoning to explore the possible states and errors that might be hidden in those states. By building a formal specification of the system, we were able to find the bug and prove the absence of further bugs of its type. Using automated reasoning also gave us the confidence to ship updates and improvements every one to two months rather than just three to four times a year.

Faster code

The correctness of the AWS Identity and Access Management (IAM) service is foundational to the security of our customers’ workloads. Across millions of customers, thousands of resource types, and hundreds of AWS services, every API call—every single request to AWS—is processed by the IAM authorization engine. That’s over 1.2 billion requests per second. This is some of the most security-critical and highly scaled software in AWS.

Before any change at AWS goes into production, we need an extremely high degree of confidence that the system remains secure and correct. Using automated reasoning, we can prove that our systems adhere to specific security properties, under an exhaustive number of circumstances. We call this provable security. Not only has automated reasoning enabled us to provide provable security assurance to our customers, it gives us the ability to deliver functionality, security, and optimization at scale.

Like S3, IAM has evolved over 15 years into a time-tested and trusted system. But we wanted to raise the bar further. We built a formal specification that captures the behavior of the existing IAM authorization engine, codified its policy evaluation principles into provable theorems, and used automated reasoning to build a new and more efficient implementation. Earlier this year, we deployed the new proved-correct authorization engine —and no one noticed. Automated reasoning allowed us to seamlessly replace one of the most critical pieces of AWS infrastructure, the authorization engine, with a proved-correct equivalent.

With the specification and proofs in place, we could safely and aggressively optimize the code with a high degree of confidence. At the massive scale of IAM, every microsecond of performance improvement translates into a better customer experience and better cost optimization for AWS. We optimized string matching, removed unnecessary memory allocation and redundant computations, strengthened security, and improved scalability. After every change, we re-ran our proofs to confirm that the system was still operating correctly.

The optimized IAM authorization engine is now 50% faster than its predecessor. We simply would not have been able to make these types of impactful optimizations with such confidence if we didn’t use automated reasoning. For a deeper look at how we did this, see this AWS re:Inforce session.

Faster deployment (of faster code)

Most secure online transactions are protected by encryption. For example, the RSA encryption algorithm protects data by generating two keys: one to encrypt the data, and one to decrypt it. These keys enable secure data transmission as well as secure digital signatures. In the context of encryption, correctness and performance are both essential—a bug in an encryption algorithm can be disastrous.

As AWS customers move their workloads to AWS Graviton, the benefits of optimizing cryptography for the ARM instruction set increase. But optimizing encryption for better performance is complex, which makes it difficult to verify that modified encryption algorithms are behaving properly. Before we started to use automated reasoning, optimizations to cryptography libraries often required months-long reviews to achieve confidence for release into production.

Enter the power of automated reasoning: formal verification made RSA faster, and faster to deploy. We are seeing similar improvements when we apply automated reasoning to elliptic curve cryptography.

The formation of a virtuous cycle

Over the last decade, we’ve increasingly applied automated reasoning techniques within AWS to prove the correctness of our cloud infrastructure and services. We routinely use these methods not only to verify correctness, but also to enhance security and reliability and minimize design flaws. Automated reasoning can be used to create a precise, testable model of a system, which we can use to quickly verify that changes are safe—or learn they are unsafe without causing harm in production.

We can answer critical questions about our infrastructure to detect misconfigurations that might expose data. We can help stop subtle but serious bugs from reaching production that we would not have found with other techniques. We can make bold performance optimizations that we would not have dared attempt without model checking. Automated reasoning provides rigorous mathematical assurance that critical systems behave as expected.

AWS is the first and only cloud provider to use automated reasoning at this scale. As adoption of automated reasoning tools increases, it becomes easier for us to justify ever-larger investments into improving the usability and scalability of automated reasoning tools. The easier it is to use the automated reasoning tools and the more powerful they become, the more adoption we’ve observed. The more we’re able to prove correctness of our cloud infrastructure, the more compelling the cloud is to security-obsessed customers. And, as the examples in this post illustrate, not only are we able to increase security assurance, we are delivering higher performant code to customers faster, translating into cost savings that we can eventually pass on to customers.

My prediction is that we’re in the beginning of an era in which critical properties like security, compliance, availability, durability, and safety can be proved automatically for large-scale cloud architectures. From preventing potential issues with AI hallucinations to analyzing hypervisors, cryptography, and distributed systems, having sound mathematical reasoning at our foundations and continuously analyzing what we build sets Amazon apart.

Learn more

Learn more about automated reasoning on the Amazon Science blog.
Understand how AWS applies automated reasoning to provide provable security.
If you’re interested in an internship with the AWS Automated Reasoning Group, let us know.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

An unexpected discovery: Automated reasoning often makes systems more efficient and easier to maintain
Author: Byron Cook