Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart
Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey… With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI… You can seamlessly deploy…
Text vectors or embeddings are numerical vector representations of text that are generated by large language models (LLMs). After LLMs are fully pre-trained on a large dataset or fine-tuned from different tasks, including text completion, question answering, and translations, text embeddings capture semantic information of the input text. Different downstream applications are made possible by text embeddings, including similarity searching, information retrieval, recommendations and personalization, multilingual translations, and more.
Before intelligent applications could be built from embeddings, enterprises and organizations had to embed their existing documents, which can be expensive and technically complicated. Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey. With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI. You can seamlessly deploy these models into production with the SageMaker JumpStart user interface or SDK. In addition, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave its own VPC, you can trust your data remains private and confidential.
In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG). We demonstrate how to do the following:
- Run inference on a text embedding model deployed from SageMaker JumpStart
- Find the nearest neighbors for an input sentence with your own dataset
- Run the batch transform on large documents to minimize costs
All the code is available on GitHub.
Deploy a text embedding model via SageMaker JumpStart
To host a model on Amazon SageMaker, the first step is to set up and authenticate the use of AWS services. In Amazon SageMaker Studio, we use the execution role associated with the notebook instance. See the following code:
On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including bge
, gte
, e5
, and more. In this post, we use huggingface-sentencesimilarity-bge-large-en
as an example. We can use the SageMaker SDK to deploy this state-of-the-art text embedding model:
Text embedding model query
Let’s look at the text embedding model query in more detail.
Text to embedding
If you have already deployed a SageMaker endpoint before, the predictor
can be restored as follows:
After the model is successfully deployed, you can query the endpoint with a batch of input texts within a JSON payload:
The correlation of the embeddings of these sentences is plotted in the following figure.
As shown in the preceding figure, same subjects are highly correlated within themselves, including Pets
, Cities
, and Color
; different subjects are much dissimilar. This indicates the embedding generated by the LLMs (in this case, bge
) can represent the semantic information accurately.
For this post, we used the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. Latency is the amount of time from the moment that a user sends a request until the time that the application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same batch of input texts on the ml.g5.2xlarge
and ml.c6i.xlarge
instances.
Model | g5.2xlarge Average Latency (ms) | c6i.xlarge Average Latency(ms) | Language Support |
---|---|---|---|
all-MiniLM-L6-v2 | 19.5 | 27.9 | English |
BGE Base En | 21.2 | 114 | English |
BGE Small En | 28.3 | 45.6 | English |
BGE Large En | 34.7 | 337 | English |
Multilingual E5 Base | 22.1 | 118 | Multilingual |
Multilingual E5 Large | 39.8 | 360 | Multilingual |
E5 Base | 25.6 | 117 | English |
E5 Base V2 | 25.2 | 123 | English |
E5 Large | 32.2 | 339 | English |
E5 Large V2 | 32.5 | 331 | English |
GTE Base | 22.2 | 112 | English |
GTE Small | 19.7 | 46 | English |
GTE Large | 39.7 | 347 | English |
Get the nearest neighbors
The deployed model from SageMaker JumpStart can also facilitate the process of identifying the nearest neighbors to queries within the corpus. When provided with queries and a corpus, the model will produce the corpus_id
, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. It uses the following parameters:
- corpus – Provides the list of inputs from which to find the nearest neighbor
- queries – Provides the list of inputs for which to find the nearest neighbor from the corpus
- top_k – The number of nearest neighbors to find from the corpus
- mode – Set as
nn_corpus
for getting the nearest neighbors to input queries within the corpus
See the following code:
We get the following output:
This result means the first query is most similar to the first corpus, the second is closer to the second corpus, and so on. This is a correct match in this example.
We also took the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. The numbers in the following table represent the average latency for a total of 100 requests using the same payload on the ml.g5.2xlarge
and ml.c6i.xlarge
instances.
Model | g5.2xlarge Average Latency (ms) | c6i.xlarge Average Latency(ms) | Language Support |
---|---|---|---|
all-MiniLM-L6-v2 | 21.7 | 69.1 | English |
BGE Base En | 29.1 | 372 | English |
BGE Small En | 29.2 | 124 | English |
BGE Large En | 47.2 | 1240 | English |
Multilingual E5 Base | 30 | 389 | Multilingual |
Multilingual E5 Large | 47.1 | 1380 | Multilingual |
E5 Base | 30.4 | 373 | English |
E5 Base V2 | 31 | 409 | English |
E5 Large | 45.9 | 1230 | English |
E5 Large V2 | 49.6 | 1220 | English |
GTE Base | 30.3 | 375 | English |
GTE Small | 28.5 | 129 | English |
GTE Large | 46.6 | 1320 | English |
Get the nearest neighbors on a large dataset
When making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5 MB, and the request timeout is set to 1 minute. If corpus size exceeds these limits, you could use a SageMaker training job, which generates embeddings for your large dataset and persists them alongside the model inside the SageMaker endpoint. Therefore, they don’t have to be passed as part of the invocation payload. The process of finding the nearest neighbors is carried out using SentenceTransformer and its utility function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and the precomputed sentence embeddings during the training job.
In the following example, we fetch and prepare the Amazon_SageMaker_FAQs
dataset to use it in finding the nearest neighbor to an input question:
For algorithm-specific training hyperparameters, the SageMaker SDK can be fetched or overwritten:
The SageMaker training consists of two steps: create the estimator object and launch the training job. The output is a model prepackaged with embeddings of your large dataset used as training data, which can be deployed for inference to get the nearest neighbor for any input sentence. See the following code:
The query syntax to convert text into embeddings is the same as before. The code to get the nearest neighbor, however, can be simplified as follows:
We can also query the endpoint with questions in the Amazon_SageMaker_FAQs
dataset and compare how many of the correct corresponding answers are returned. In the following example, we measure the top-3 accuracy, given there could be similar question answer pairs. This means if the correct answer is returned as one of the top-3 returns, it’s treated as a correct query.
Run a batch transform to get embeddings on large datasets
For enterprises and organizations with a large volume of historical documents that exceed the memory of a single endpoint instance, you can use SageMaker batch transform to save cost. When you start a batch transform job, SageMaker launches the necessary compute resources to process the data. During the job, SageMaker automatically provisions and manage the compute resources. When the batch transform job is complete, those resources are automatically cleaned up, which minimizes costs. By dividing a large dataset into smaller chunks and using more instances, you can scale out the compute for faster inference with similar cost, without managing infrastructure. The maximum payload for batch transform is 100 MB and timeout is 1 hour.
The input format for our batch transform job is a JSONL file, with entries as a line of JSON, which consists of id
and text_inputs
. See the following code:
When the data is ready in Amazon Simple Storage Service (Amazon S3), you can create the batch transform object from the SageMaker JumpStart model, which triggers the transform job:
After the batch transform job is complete, you can download the result from Amazon S3:
Conclusion
SageMaker JumpStart provides a straightforward way to use state-of-the-art large language foundation models for text embedding and semantic search. With the user interface or just a few lines of code, you can deploy a highly accurate text embedding model and find semantic matches across large datasets, at scale and cost-efficiently. SageMaker JumpStart removes the barriers to implement semantic search by providing instant access to cutting-edge models like the ones benchmarked on the MTEB leaderboard. Businesses and developers can build intelligent search and recommendation systems faster.
This post demonstrated how to find semantically similar questions and answers, which could be applied to RAG use cases, recommendations and personalization, multilingual translations, and more. With continued advances in language models and the simplicity of SageMaker JumpStart, more organizations can infuse generative AI capabilities into their products. As the next step, you can try text-embedding models from SageMaker JumpStart on your own dataset to test and benchmark the results for your RAG use cases.
About the Authors
Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.
Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Author: Baichuan Sun