Create an HCLS document summarization application with Falcon using Amazon SageMaker JumpStart

TutoSartup excerpt from this article:
With unique data formats and strict regulatory requirements, customers are looking for choices to select the most performant and cost-effective model, as well as the ability to perform necessary customization (fine-tuning) to fit their business use case… In this post, we walk you through deployin…

Healthcare and life sciences (HCLS) customers are adopting generative AI as a tool to get more from their data. Use cases include document summarization to help readers focus on key points of a document and transforming unstructured text into standardized formats to highlight important attributes. With unique data formats and strict regulatory requirements, customers are looking for choices to select the most performant and cost-effective model, as well as the ability to perform necessary customization (fine-tuning) to fit their business use case. In this post, we walk you through deploying a Falcon large language model (LLM) using Amazon SageMaker JumpStart and using the model to summarize long documents with LangChain and Python.

Solution overview

Amazon SageMaker is built on Amazon’s two decades of experience developing real-world ML applications, including product recommendations, personalization, intelligent shopping, robotics, and voice-assisted devices. SageMaker is a HIPAA-eligible managed service that provides tools that enable data scientists, ML engineers, and business analysts to innovate with ML. Within SageMaker is Amazon SageMaker Studio, an integrated development environment (IDE) purpose-built for collaborative ML workflows, which, in turn, contain a wide variety of quickstart solutions and pre-trained ML models in an integrated hub called SageMaker JumpStart. With SageMaker JumpStart, you can use pre-trained models, such as the Falcon LLM, with pre-built sample notebooks and SDK support to experiment with and deploy these powerful transformer models. You can use SageMaker Studio and SageMaker JumpStart to deploy and query your own generative model in your AWS account.

You can also ensure that the inference payload data doesn’t leave your VPC. You can provision models as single-tenant endpoints and deploy them with network isolation. Furthermore, you can curate and manage the selected set of models that satisfy your own security requirements by using the private model hub capability within SageMaker JumpStart and storing the approved models in there. SageMaker is in scope for HIPAA BAA, SOC123, and HITRUST CSF.

The Falcon LLM is a large language model, trained by researchers at Technology Innovation Institute (TII) on over 1 trillion tokens using AWS. Falcon has many different variations, with its two main constituents Falcon 40B and Falcon 7B, comprised of 40 billion and 7 billion parameters, respectively, with fine-tuned versions trained for specific tasks, such as following instructions. Falcon performs well on a variety of tasks, including text summarization, sentiment analysis, question answering, and conversing. This post provides a walkthrough that you can follow to deploy the Falcon LLM into your AWS account, using a managed notebook instance through SageMaker JumpStart to experiment with text summarization.

The SageMaker JumpStart model hub includes complete notebooks to deploy and query each model. As of this writing, there are six versions of Falcon available in the SageMaker JumpStart model hub: Falcon 40B Instruct BF16, Falcon 40B BF16, Falcon 180B BF16, Falcon 180B Chat BF16, Falcon 7B Instruct BF16, and Falcon 7B BF16. This post uses the Falcon 7B Instruct model.

In the following sections, we show how to get started with document summarization by deploying Falcon 7B on SageMaker Jumpstart.

Prerequisites

For this tutorial, you’ll need an AWS account with a SageMaker domain. If you don’t already have a SageMaker domain, refer to Onboard to Amazon SageMaker Domain to create one.

Deploy Falcon 7B using SageMaker JumpStart

To deploy your model, complete the following steps:

Navigate to your SageMaker Studio environment from the SageMaker console.
Within the IDE, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
Deploy the Falcon 7B Instruct model to an endpoint for inference.

This will open the model card for the Falcon 7B Instruct BF16 model. On this page, you can find the Deploy or Train options as well as links to open the sample notebooks in SageMaker Studio. This post will use the sample notebook from SageMaker JumpStart to deploy the model.

Choose Open notebook.

Run the first four cells of the notebook to deploy the Falcon 7B Instruct endpoint.

You can see your deployed JumpStart models on the Launched JumpStart assets page.

In the navigation pane, under SageMaker Jumpstart, choose Launched JumpStart assets.
Choose the Model endpoints tab to view the status of your endpoint.

With the Falcon LLM endpoint deployed, you are ready to query the model.

Run your first query

To run a query, complete the following steps:

On the File menu, choose New and Notebook to open a new notebook.

You can also download the completed notebook here.

Select the image, kernel, and instance type when prompted. For this post, we choose the Data Science 3.0 image, Python 3 kernel, and ml.t3.medium instance.

Import the Boto3 and JSON modules by entering the following two lines into the first cell:

import json
import boto3

Press Shift + Enter to run the cell.
Next, you can define a function that will call your endpoint. This function takes a dictionary payload and uses it to invoke the SageMaker runtime client. Then it deserializes the response and prints the input and generated text.

newline, bold, unbold = 'n', '33[1m', '33[0m'
endpoint_name = 'ENDPOINT_NAME'

def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=json.dumps(payload).encode('utf-8'))
    model_predictions = json.loads(response['Body'].read())
    generated_text = model_predictions[0]['generated_text']
    print (
        f"Input Text: {payload['inputs']}{newline}"
        f"Generated Text: {bold}{generated_text}{unbold}{newline}")

The payload includes the prompt as inputs, together with the inference parameters that will be passed to the model.

You can use these parameters with the prompt to tune the output of the model for your use case:

payload = {
    "inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:",
    "parameters":{
        "max_new_tokens": 50,
        "return_full_text": False,
        "do_sample": True,
        "top_k":10
        }
}

Query with a summarization prompt

This post uses a sample research paper to demonstrate summarization. The example text file is concerning automatic text summarization in biomedical literature. Complete the following steps:

Download the PDF and copy the text into a file named document.txt.
In SageMaker Studio, choose the upload icon and upload the file to your SageMaker Studio instance.

Out of the box, the Falcon LLM provides support for text summarization.

Let’s create a function that uses prompt engineering techniques to summarize document.txt:

def summarize(text_to_summarize):
    summarization_prompt = """Process the following text and then perform the instructions that follow:

{text_to_summarize}

Provide a short summary of the preceeding text.

Summary:"""
    payload = {
        "inputs": summarization_prompt,
        "parameters":{
            "max_new_tokens": 150,
            "return_full_text": False,
            "do_sample": True,
            "top_k":10
            }
    }
    response = query_endpoint(payload)
    print(response)
    
with open("document.txt") as f:
    text_to_summarize = f.read()

summarize(text_to_summarize)

You’ll notice that for longer documents, an error appears—Falcon, alongside all other LLMs, has a limit on the number of tokens passed as input. We can get around this limit using LangChain’s enhanced summarization capabilities, which allows for a much larger input to be passed to the LLM.

Import and run a summarization chain

LangChain is an open-source software library that allows developers and data scientists to quickly build, tune, and deploy custom generative applications without managing complex ML interactions, commonly used to abstract many of the common use cases for generative AI language models in just a few lines of code. LangChain’s support for AWS services includes support for SageMaker endpoints.

LangChain provides an accessible interface to LLMs. Its features include tools for prompt templating and prompt chaining. These chains can be used to summarize text documents that are longer than what the language model supports in a single call. You can use a map-reduce strategy to summarize long documents by breaking it down into manageable chunks, summarizing them, and combining them (and summarized again, if needed).

Let’s install LangChain to begin:

%pip install langchain

Import the relevant modules and break down the long document into chunks:

import langchain
from langchain import SagemakerEndpoint, PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 500,
                    chunk_overlap  = 20,
                    separators = [" "],
                    length_function = len
                )
input_documents = text_splitter.create_documents([text_to_summarize])

To make LangChain work effectively with Falcon, you need to define the default content handler classes for valid input and output:

class ContentHandlerTextSummarization(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> json:
        response_json = json.loads(output.read().decode("utf-8"))
        generated_text = response_json[0]['generated_text']
        return generated_text.split("summary:")[-1]
    
content_handler = ContentHandlerTextSummarization()

You can define custom prompts as PromptTemplate objects, the main vehicle for prompting with LangChain, for the map-reduce summarization approach. This is an optional step because mapping and combine prompts are provided by default if the parameters within the call to load the summarization chain (load_summarize_chain) are undefined.

map_prompt = """Write a concise summary of this text in a few complete sentences:

{text}

Concise summary:"""

map_prompt_template = PromptTemplate(
                        template=map_prompt, 
                        input_variables=["text"]
                      )


combine_prompt = """Combine all these following summaries and generate a final summary of them in a few complete sentences:

{text}

Final summary:"""

combine_prompt_template = PromptTemplate(
                            template=combine_prompt, 
                            input_variables=["text"]
                          )

LangChain supports LLMs hosted on SageMaker inference endpoints, so instead of using the AWS Python SDK, you can initialize the connection through LangChain for greater accessibility:

summary_model = SagemakerEndpoint(
                    endpoint_name = endpoint_name,
                    region_name= "us-east-1",
                    model_kwargs= {},
                    content_handler=content_handler
                )

Finally, you can load in a summarization chain and run a summary on the input documents using the following code:

summary_chain = load_summarize_chain(llm=summary_model,
                                     chain_type="map_reduce", 
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
                                     verbose=True
                                    ) 
summary = summary_chain({"input_documents": input_documents, 'token_max': 700}, return_only_outputs=True)
print(summary["output_text"])

Because the verbose parameter is set to True, you’ll see all of the intermediate outputs of the map-reduce approach. This is useful for following the sequence of events to arrive at a final summary. With this map-reduce approach, you can effectively summarize documents much longer than is normally allowed by the model’s maximum input token limit.

Clean up

After you’ve finished using the inference endpoint, it’s important to delete it to avoid incurring unnecessary costs through the following lines of code:

client = boto3.client('runtime.sagemaker')
client.delete_endpoint(EndpointName=endpoint_name)

Using other foundation models in SageMaker JumpStart

Utilizing other foundation models available in SageMaker JumpStart for document summarization requires minimal overhead to set up and deploy. LLMs occasionally vary with the structure of input and output formats, and as new models and pre-made solutions are added to SageMaker JumpStart, depending on the task implementation, you may have to make the following code changes:

If you are performing summarization via the summarize() method (the method without using LangChain), you may have to change the JSON structure of the payload parameter, as well as the handling of the response variable in the query_endpoint() function
If you are performing summarization via LangChain’s load_summarize_chain() method, you may have to modify the ContentHandlerTextSummarization class, specifically the transform_input() and transform_output() functions, to correctly handle the payload that the LLM expects and the output the LLM returns

Foundation models vary not only in factors such as inference speed and quality, but also input and output formats. Refer to the LLM’s relevant information page on expected input and output.

Conclusion

The Falcon 7B Instruct model is available on the SageMaker JumpStart model hub and performs on a number of use cases. This post demonstrated how you can deploy your own Falcon LLM endpoint into your environment using SageMaker JumpStart and do your first experiments from SageMaker Studio, allowing you to rapidly prototype your models and seamlessly transition to a production environment. With Falcon and LangChain, you can effectively summarize long-form healthcare and life sciences documents at scale.

For more information on working with generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building document summarization proofs of concept for your healthcare and life science-oriented GenAI applications using the method outlined in this post. When Amazon Bedrock is generally available, we will publish a follow-up post showing how you can implement document summarization using Amazon Bedrock and LangChain.

About the Authors

John Kitaoka is a Solutions Architect at Amazon Web Services. John helps customers design and optimize AI/ML workloads on AWS to help them achieve their business goals.

Josh Famestad is a Solutions Architect at Amazon Web Services. Josh works with public sector customers to build and execute cloud based approaches to deliver on business priorities.

Create an HCLS document summarization application with Falcon using Amazon SageMaker JumpStart
Author: John Kitaoka