Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

TutoSartup excerpt from this article:
As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other… In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and comput…

As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative AI. Just last year, many businesses, and even more individuals, were focused on learning and experimenting, and the sheer number of POCs was impressive. Thousands of customers, across diverse industries, conducted experiments anywhere from dozens to hundreds of experiments as they explored the potential of generative AI applications and the implications.

By early 2024, we are beginning to see the start of “Act 2,” in which many POCs are evolving into production, delivering significant business value. To learn more about Act 1 and Act 2, refer to Are we prepared for “Act 2” of gen AI?. The move to a production mindset focuses new attention on key challenges as companies build and evaluate models on specific tasks and search for the leanest, fastest, and most cost-effective options. Considering—and reducing—the investment required for production workloads means bringing new efficiency to the sometime complicated process of building, testing, and fine-tuning foundation models (FMs).

Delivering capabilities that increase efficiency and reduce costs

Offering multiple entry points to their generative AI journey is critical to delivering value to companies moving their generative AI applications into production. Our generative AI technology stack provides the services and capabilities necessary to build and scale generative AI applications—from Amazon Q (the most capable generative AI–powered assistant for accelerating software development) at the top layer to Amazon Bedrock (The easiest way to build and scale generative AI applications with foundation models) at the middle layer to Amazon SageMaker (purpose-built to help you build, train, and deploy FMs) at the foundational, bottom layer. While these layers provide different points of entry, the fundamental truth is that every generative AI journey starts at the foundational bottom layer.

Organizations that want to build their own models or want granular control are choosing Amazon Web Services (AWS) because we are helping customers use the cloud more efficiently and leverage more powerful, price-performant AWS capabilities such as petabyte-scale networking capability, hyperscale clustering, and the right tools to help you build. Our deep investment in this layer enhances the capabilities and efficiency of the services we provide at higher layers.

To make generative AI use cases economical, you need to run your training and inference on incredibly high-performing, cost-effective infrastructure that’s purpose-built for AI. Amazon SageMaker makes it easy to optimize at each step of the model lifecycle, whether you are building, training, or deploying. However, FM training and inference present challenges—including operational burden, overall cost, and performance lag that contributes to an overall subpar user experience. State-of-the-art generative AI models are averaging latencies in the order of seconds, and many of today’s massive models are too large to fit into a single instance.

In addition, the blistering pace of model optimization innovations leaves model builders with months of research to learn and implement these techniques, even before finalizing deployment configurations.

Introducing Amazon Elastic Kubernetes Service (Amazon EKS) in Amazon SageMaker HyperPod

Recognizing these challenges, AWS launched Amazon SageMaker HyperPod last year. Taking efficiency one step further, earlier this week, we announced the launch of Amazon EKS support on Amazon SageMaker HyperPod. Why? Because provisioning and managing the large GPU clusters needed for AI can pose a significant operational burden. And training runs that take weeks to complete are challenging, since a single failure can derail the entire process. Ensuring infrastructure stability and optimizing performance of distributed training workloads can also pose challenges.

Amazon SageMaker HyperPod provides a fully managed service that removes the operational burden and enables enterprises to accelerate FM development at an unprecedented scale. Now, support for Amazon EKS in Amazon SageMaker HyperPod makes it possible for builders to manage their SageMaker HyperPod clusters using Amazon EKS. Builders can use a familiar Kubernetes interface while eliminating the undifferentiated heavy lifting involved in setting up and optimizing these clusters for generative AI model development at scale. SageMaker HyperPod provides a highly resilient environment that automatically detects, diagnoses, and recovers from underlying infrastructure faults so that builders can train FMs for weeks or months at a time with minimal disruption.

Customer quote: Articul8 AI

“Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based SageMaker HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our gen AI operations.

As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us because it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers because we are now able to package and productize this capability into our gen AI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Arun Subramaniyan, Founder and CEO of Articul8 AI

Bringing new efficiency to inference

Even with the latest advancements in generative AI modeling, the inference phase remains a significant bottleneck. We believe that businesses creating customer or consumer-facing generative AI applications shouldn’t have to sacrifice performance for cost-efficiency. They should be able to get both. That’s why two months ago, we released the inference optimization toolkit on Amazon SageMaker, a fully managed solution that provides the latest model optimization techniques, such as speculative decoding, compilation, and quantization. Available across SageMaker, this toolkit offers a simple menu of the latest optimization techniques that can be used individually or together to create an “optimization recipe.” Thanks to easy access and implementation of these techniques, customers can achieve up to ~2x higher throughput while reducing costs by ~50% for generative AI inference.

Responsible model deployment that is safe and trustworthy

While cost and performance are critical issues, it’s important not to lose sight of other concerns that come to the forefront as we shift from POC to production. No matter what model you choose, it needs to be deployed in a safe, trustworthy, and responsible way. We all need to be able to unlock generative AI’s full potential while mitigating its risks. It should be easy to implement safeguards for your generative AI applications, customized to your requirements and responsible AI policies.

That’s why we built Amazon Bedrock Guardrails, a service that provides customizable safeguards so you can filter prompts and model responses. Guardrails can help block specific words or topics. As well, customers can use Guardrails to help identify and prevent restricted content from reaching end users.

We also have filters for harmful content and personal identifiable information (PII) and security checks for malicious prompts, such as prompt injections. Recently, we also developed guardrails to help reduce hallucinations by checking that responses are found in the source material and related to the query.

Delivering value with game-changing innovation

Our partnership with the NFL and our joint Next Gen Stats program offer impressive proof of how a production mindset is delivering true value not only to an organization but to people across the world. By using AWS AI tools and engineers, the NFL is taking tackle analysis to the next level, giving teams, broadcasters, and fans deeper insights into one of football’s most crucial skills—tackling. As fans know, tackling is a complex, evolving process that unfolds throughout each play. But traditional stats only tell part of the story. That’s why the NFL and AWS created Tackle Probability—a groundbreaking AI-powered metric that can identify a missed tackle, when and where that tackle attempt took place, and do it all in real time. For further detail, go to NFL on AWS.

Building this stat required 5 years of historical data to train an AI model on Amazon SageMaker capable of processing millions of data points per game, tracking 20 different features for each of the 11 defenders every tenth of a second. The result is a literally game-changing stat that provides unprecedented insights. Now the NFL can quantify tackling efficiency in ways never before possible. A defender can be credited with 15 tackle attempts in a game without a single miss, or we can measure how many missed tackles a running back forced. All told, there will be at least 10 new stats from this model.

For the NFL, coaches can now quantify tackling efficiency and identify players who consistently put themselves in the right position to make the play. And broadcasters can highlight broken or made tackles to fans in real time.

Building breakthroughs with AWS

The NFL is far from alone in making in using AWS to shift its focus from POC to production. Exciting startups like Evolutionary Scale are making it easy to generate new proteins and antibodies. Airtable is making it easier for their customers to use their data and build applications. And organizations like Slack are embedding generative AI into the workday. Fast-moving, successful start-ups are choosing AWS to build and accelerate their businesses. In fact, 96 percent of all AI/ML unicorns—and 90 percent of the 2024 Forbes AI 50—are AWS customers.

Why? Because we’re addressing the cost, performance, and security issues that enable production-grade generative AI applications. We’re empowering data scientists, ML engineers, and other builders with new capabilities that make generative AI development faster, easier, more secure, and less costly. We’re making FM building and tuning—and a portfolio of intuitive tools that make it happen—available to more organizations as part of our ongoing commitment to the democratization of generative AI.

Fueling the next wave of innovation

Optimizing costs, boosting production efficiency, and ensuring security—these are among the top challenges as generative AI evolves from POC production. We’re helping address these issues by adding innovative new capabilities to Amazon SageMaker, Amazon Bedrock, and beyond. And we’re lowering the barriers to entry by making these tools available to everyone, from large enterprises with ML teams to small businesses and individual developers just getting started. Empowering more people and organizations to experiment with generative AI creates an explosion of creative new use cases and applications. That’s exactly what we’re seeing as generative AI continues its rapid evolution from a fascinating technology to a day-to-day reality—improving experiences, inspiring innovation, boosting the competitive edge, and creating significant new value.

About the author

Baskar Sridharan is the Vice President for AI/ML and Data Services & Infrastructure, where he oversees the strategic direction and development of key services, including Bedrock, SageMaker, and essential data platforms like EMR, Athena, and Glue.

Prior to his current role, Baskar spent nearly six years at Google, where he contributed to advancements in cloud computing infrastructure. Before that, he dedicated 16 years to Microsoft, playing a pivotal role in the development of Azure Data Lake and Cosmos, which have significantly influenced the landscape of cloud storage and data management.

Baskar earned a Ph.D. in Computer Science from Purdue University and has since spent over two decades at the forefront of the tech industry.

He has lived in Seattle for over 20 years, where he, his wife, and two children embrace the beauty of the Pacific Northwest and its many outdoor activities. In his free time, Baskar enjoys practicing music and playing cricket and baseball with his kids.

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security
Author: Baskar Sridharan