Operationalize generative AI applications on AWS: Part II – Architecture Deep Dive

TutoSartup excerpt from this article:
We will break the architecture into four sections: The APIs used to invoke the text and image foundation models The code pipeline to build, test, and deploy architecture changes Continuous fine-tuning to create custom models RAG with automated vector database hydration By the end of t…

In our previous blog, Operationalize generative AI applications on AWS: Part I, we covered the basics of how companies are using generative AI and Large Language Model Operations (LLMOps) on AWS. We discussed use cases for game companies using generative AI, ways to customize LLM, and a high-level architecture for the LLMOps solution on AWS. In this blog we will dive deeper into the architecture.

The architecture that we describe here is a solution guidance to create non-player characters (NPC) for games using generative AI services. The application provides three choices for generating dialogs for the NPC character, using responses from (1) base LLM; or (2) finetuned LLM; or (3) a games-specific, context-aware Retrieval Augmented Generation (RAG) pipeline. The application is then deployed using a continuous integration and continuous deployment (CI/CD) pipeline. This entire architecture is serverless meaning there is no infrastructure to manage, all services scale automatically, and you only pay for what you use. Although we describe this in the context of games, the LLMOps architecture can be applied to operationalizing any other type of generative AI application.

We will break the architecture into four sections:

  1. The APIs used to invoke the text and image foundation models
  2. The code pipeline to build, test, and deploy architecture changes
  3. Continuous fine-tuning to create custom models
  4. RAG with automated vector database hydration

By the end of this blog, the architecture will look as follows:

The High-Level Architecture shows how the user interacts with the solution. There is a base model invocation, retrieval augmented generation invocation, fine-tuning upload, and vector database hydration upload. All these flows will be discussed in detail, in this blog.

Figure 1: High level Architecture

Base Model Inference

In the base model inference, prompts are sent to Amazon API Gateway which forward the requests to AWS Lambda. Lambda sends the prompt to Amazon Bedrock using boto3.

Figure 2: Base Model Inference

 

The foundational aspect of any generative AI application is the model used to generate content and the APIs used to invoke these models. This solution is built using foundation models hosted on Amazon Bedrock. These models are accessed by sending prompts through an API hosted on Amazon API Gateway, which triggers an AWS Lambda function. The Lambda functions are coded with prompt templates tailored for the specific model they’re associated with, and the prompts along with the prompt templates invoke the model via the Bedrock invoke_model() SDK method.

Amazon Bedrock supports text, image, and embedding models allowing for this architecture to be used for many different use cases.

Code Pipeline: High-Level

In the High-Level Code Pipeline architecture, AWS CodePipeline is used to start an execution that runs AWS CodeBuild. This pipeline is used to manage the base model inference flow.

Figure 3: High-Level Code Pipeline

 

Once the APIs are created, the next step is to wrap it in a CI/CD pipeline (AWS CodePipeline) to automate the continuous integration and continuous deployment of your application. At this point, models aren’t being customized and the application is only interacting with FMs, so there is no Continuous Tuning aspect yet. While the diagram above only changed slightly, below is what is actually being deployed.

Code Pipeline: Detailed

In the Detailed Code Pipeline architecture, users upload code changes to AWS CodeCommit. This triggers a build stage in AWS CodeBuild. Cloud Development Kit (CDK) code is synthesized into a CloudFormation Template and deployed into a QA Account. Additionally, it’s passed to another AWS CodeBuild stage for system tests. If these are passed, the CloudFormation template is then deployed in the production account.

Figure 4: Detailed Code Pipeline

 

An AWS CodeCommit Repository is created to store the application code. When any source code changes are made, the pipeline is re-executed. At each stage, the pipeline checks to see if the changes affected the current stage, and if it does, automatically makes the necessary changes. This is the self-mutating piece of the pipeline. The pipeline builds and tests the application, deploys it into a QA environment, runs system tests on the deployment, and if they succeed, deploys the infrastructure changes into production. Deploying into QA first allows you to catch potential issues before deploying to production.

Continuous Fine-Tuning

The High-Level Continuous Tuning flow shows how you can upload data to Amazon S3 and automatically trigger a fine-tuning job in Amazon Bedrock.

Figure 5: High-Level Continuous Tuning

 

Now that the APIs and models can be tested in environments before deploying into production, it is safe to start fine-tuning models, if needed, to create custom models. As discussed in the first blog post, continuous tuning consists of data preprocessing, model tuning, model evaluation, and model registration. In this solution, these four steps are automated with an Amazon SageMaker Pipelines and an AWS Step Functions flow.

The Automated Fine-Tuning Pipeline diagram shows how when data is uploaded to Amazon S3, an AWS Lambda Function triggers a SageMaker Pipeline. This pipeline preprocesses the data, before starting a fine-tuning job in Amazon Bedrock. Once the model has completed fine-tuning, it’s evaluated and stored in a model registry. Once the model is approved, an event is sent to Amazon EventBridge which triggers AWS CodePipeline to update the infrastructure to use the new model.

Figure 6: Automated Fine-Tuning Pipeline

 

When data is uploaded to the specific Amazon Simple Storage Service (S3) bucket, a lambda function is triggered which kicks off a SageMaker Pipeline. The SageMaker pipeline automates data preprocessing, model fine tuning, model evaluation, and model registration. Each step in the pipeline can be an Amazon SageMaker Processing Job, but as of the writing this blog (April 2024), it does not natively support fine-tuning jobs in Amazon Bedrock. To get around this, a callback step is used to send a message to an Amazon Simple Queue Service (SQS) Queue which starts a Step Function Flow.

The Fine-Tuning Call Back Step shows how we orchestrate the fine-tuning job in Amazon Bedrock using AWS Step Functions. This step functions flow starts the fine-tuning job, waits for it to either complete or fail, and then returns the result to the SageMaker Pipeline.

Figure 7: Fine-Tuning Call Back Step Function

 

This Step Function Flow uses Lambda Functions to start the fine-tuning job on Amazon Bedrock, waits 10 minutes, checks the progress of the job, and if it is incomplete waits another 10 minutes and repeats the process until it completes. Once the tuning job completes, a successful callback is sent back to the SageMaker Pipeline.

The data preprocessing and model evaluation steps run SageMaker Processing Jobs, which allow you to run a script on a processing container fully managed by Amazon SageMaker, which is only provisioned for the duration of the job. The final step is a RegisterModel Step within SageMaker Pipelines, which creates a model package and registers it in SageMaker Model Registry. A model package is a reusable model artifacts abstraction that packages all ingredients necessary for inference.

At this point, the new model version is stored in SageMaker Model Registry and the fine-tuned, custom model is available in Bedrock. Once the model is provisioned for use via Bedrock, code pipeline is triggered to use this new model. To run inference against a custom model in Amazon Bedrock, provisioned throughput must be associated with the custom model. To make the association, purchase provisioned throughput on the Bedrock page in the AWS Management Console and select your custom model.

Once Provisioned Throughput is in InService, the model can be approved within the Model Registry. Upon Approval, an Amazon EventBridge message triggers a Lambda Function which updates a variable within AWS Systems Manager Parameter Store and begins an execution of the Code Pipeline. The pipeline identifies a change in the Parameter Store Variable and updates the applications to use the new model. Assuming all tests, pass, the application will automatically be deployed into production using the new custom model.

Retrieval Augments Generation (RAG)

The Retrieval Augmented Generation Diagram shows how an Amazon OpenSearch Service domain is added as a vector database. This information is used by the model as additional context.

Figure 8: Retrieval Augmented Generation

 

As an alternative way to add context and additional information to the model without fine-tuning, we can utilize RAG. RAG consists of two main parts: (1) Converting the data to vector space and storing that in a Vector Database which we call hydration, and (2) Adding context from the vector database to incoming prompts.

Database hydration

In the automated database hydration pipeline, data is uploaded to Amazon S3. This triggers an AWS Lambda function that starts a SageMaker processing job. This chunks the data and generates embeddings for the data using embedding models on Amazon SageMaker. The data is then stored in Amazon OpenSearch Service.

Figure 9: Automated database hydration

 

Hydrating the vector database consists of processing your documents or data into chunks, creating embeddings for the chunks by using an embeddings model, and storing these embeddings along with the data in a vector database. We automate this process by starting a SageMaker Processing Job whenever new data is uploaded to S3. The processing job chunks the data, creates embeddings of the data using an embeddings model hosted on Amazon Bedrock, and stores the output in an Amazon OpenSearch Service Cluster, which is used as the vector database.

Adding context

When a prompt is sent to the model, the prompt is converted to vector embeddings and a similarity search is run against the vector database. The search returns the top x number of most relevant documents which are then added to the prompt and passed to the LLM. We use lambda function to orchestrate this RAG pipeline.

Conclusion

In this blog, we described LLMOps architecture in the context of a generative AI application built using native AWS services. The generative AI application uses Amazon Bedrock models to create dialogs for the NPC character. We described how to automate fine tuning of models and also automate RAG pipeline for augmenting the incoming prompts. We showed you how to wrap the entire application in a CI/CD pipeline. In the next and final post in the three-part series, we will dive into the code used to deploy this solution. For a step-by-step walkthrough of this solution, refer to the following workshop: Operationalize Generative AI Applications on AWS.

Operationalize generative AI applications on AWS: Part II – Architecture Deep Dive
Author: Andrew Walko