A practical guide to Amazon Nova Multimodal Embeddings

TutoSartup excerpt from this article:
The right embedding model should deliver strong baseline performance, adapt to your specific use-case, and support the modalities you need now and in the future… The Amazon Nova Multimodal Embeddings model generates embeddings tailored to your specific use case—from single-modality text or ima…

Embedding models power many modern applications—from semantic search and Retrieval-Augmented Generation (RAG) to recommendation systems and content understanding. However, selecting an embedding model requires careful consideration—after you’ve ingested your data, migrating to a different model means re-embedding your entire corpus, rebuilding vector indexes, and validating search quality from scratch. The right embedding model should deliver strong baseline performance, adapt to your specific use-case, and support the modalities you need now and in the future.

The Amazon Nova Multimodal Embeddings model generates embeddings tailored to your specific use case—from single-modality text or image search to complex multimodal applications spanning documents, videos, and mixed content.

In this post, you will learn how to use Amazon Nova Multimodal Embeddings for your specific use cases:

  • Simplify your architecture with cross-modal search and visual document retrieval
  • Optimize performance by selecting embedding parameters matched to your workload
  • Implement common patterns through solution walkthroughs for media search, ecommerce discovery, and intelligent document retrieval

This guide provides a practical foundation to configure Amazon Nova Multimodal Embeddings for media asset search systems, product discovery experiences, and document retrieval applications.

Multimodal business use cases

You can use Amazon Nova Multimodal Embeddings across multiple business scenarios. The following table provides typical use cases and query examples:

ModalityContent typeUse casesTypical query examples
Video retrievalShort video searchAsset library and media management“Children opening Christmas presents,” “Blue whale breaching the ocean surface”
Long video segment searchFilm and entertainment, broadcast media, security surveillance“Specific scene in a movie,” “Specific footage in news,” “Specific behavior in surveillance”
Duplicate content identificationMedia content managementSimilar or duplicate video identification
Image retrievalThematic image searchAsset library, storage, and media management“Red car with sunroof driving along the coast”
Image reference searchE-commerce, design“Shoes similar to this” +<image>
Reverse image searchContent managementFind similar content based on uploaded image
Document retrievalSpecific information pagesFinancial services, marketing markups, advertising brochuresText information, data tables, chart page
Cross-page comprehensive informationKnowledge retrieval enhancementComprehensive information extraction from multi-page text, charts, and tables
Text retrievalThematic information retrievalKnowledge retrieval enhancement“Next steps in reactor decommissioning procedures”
Text similarity analysisMedia content managementDuplicate headline detection
Automatic topic clusteringFinance, healthcareSymptom classification and summarization
Contextual association retrievalFinance, legal, insurance“Maximum claim amount for corporate inspection accident violations”
Audio and voice retrievalAudio retrievalAsset library and media asset management“Christmas music ringtone,” “Natural tranquil sound effects”
Long audio segment searchPodcasts, meeting recordings“Podcast host discussing neuroscience and sleep’s impact on brain health”

Optimize performance for specific use cases

Amazon Nova Multimodal Embeddings model optimizes its performance for specific use cases with embeddingPurpose parameter settings. It has different vectorization strategies: retrieval system mode and ML task mode.

  • Retrieval system mode (including GENERIC_INDEX and various *_RETRIEVAL parameters) targets information retrieval scenarios, distinguishing between two asymmetric phases: storage/INDEX and query/RETRIEVAL. See the following table for retrieval system categories and parameter selection.
PhaseParameter selectionReason
Storage phase (all types)GENERIC_INDEXOptimized for indexing and storage
Query phase (mixed-modal repository)GENERIC_RETRIEVALSearch in mixed content
Query phase (text-only repository)TEXT_RETRIEVALSearch in text-only content
Query phase (image-only repository)IMAGE_RETRIEVALSearch in images (photos, illustrations, and so on)
Query phase (document image-only repository)DOCUMENT_RETRIEVALSearch in document images (scans, PDF screenshots, and so on)
Query phase (video-only repository)VIDEO_RETRIEVALSearch in videos
Query phase (audio-only repository)AUDIO_RETRIEVAL/td>Search in audio
  • ML task mode (including CLASSIFICATION and CLUSTERING parameters) targets machine learning scenarios. This parameter enables the model to flexibly adapt to different types of downstream task requirements.
  • CLASSIFICATION: Generated vectors are more suitable for distinguishing classification boundaries, facilitating downstream classifier training or direct classification.
  • CLUSTERING: Generated vectors are more suitable for forming cluster centers, facilitating downstream clustering algorithms.

Walkthrough of building multimodal search and retrieval solution

Amazon Nova Multimodal Embeddings is purpose-built for multimodal search and retrieval, which is the foundation of multimodal agentic RAG systems. The following diagrams show how to build a multimodal search and retrieval solution.

RAG solution with Amazon Nova Multimodal Embeddings

In a multimodal search and retrieval solution, shown in the preceding diagram, raw content—including text, images, audio, and video—is initially transformed into vector representations through an embedding model to encapsulate semantic features. Subsequently, these vectors are stored in a vector database. User queries are similarly converted into query vectors within the same vector space. The retrieval of the top K most relevant items is achieved by calculating the similarity between the query vector and the indexed vectors. This multimodal search and retrieval solution can be encapsulated as a Model Context Protocol (MCP) tool, thereby facilitating access within a multimodal agentic RAG solution, shown in the following diagram.

Agentic RAG solution with Amazon Nova Multimodal Embeddings

The multimodal search and retrieval solution can be divided into two distinct data flows:

  1. Data ingestion
  2. Runtime search and retrieval

The following lists the common modules within each data flow, along with the associated tools and technologies:

Data flowModuleDescriptionCommon tools and technologies
Data ingestionGenerate embeddingsConvert inputs (text, images, audio, video, and so on) into vector representationsEmbeddings model.
Store embeddings in vector storesStore generated vectors in a vector database or storage structure for subsequent retrievalPopular vector databases
Runtime search and retrievalSimilarity Retrieval AlgorithmCalculate similarity and distance between query vectors and indexed vectors, retrieve closest itemsCommon distances: cosine similarity, inner product, Euclidean distanceDatabase support for k-NN and ANN, such as Amazon OpenSearch k-NN
Top K Retrieval and Voting MechanismSelect the top K nearest neighbors from retrieval results, then possibly combine multiple strategies (voting, reranking, fusion)For example, top K nearest neighbors, fusion of keyword retrieval and vector retrieval (hybrid search)
Integration Strategy and Hybrid RetrievalCombine multiple retrieval mechanisms or modal results, such as keyword and vector or, text and image retrieval fusionHybrid search (such as Amazon OpenSearch hybrid)

We will explore several cross-modal business use cases and provide a high-level overview of how to address them using Amazon Nova Multimodal Embeddings.

Use case: Product retrieval and classification

E-commerce applications require the capability to automatically classify product images and identify similar items without the need for manual tagging. The following diagram illustrates a high-level solution:

Product categorization with Amazon Nova Multimodal Embeddings

  1. Convert product images to embeddings using Amazon Nova Multimodal Embeddings
  2. Store embeddings and labels as metadata in a vector database
  3. Query new product images and find the top K similar products
  4. Use a voting mechanism on retrieved results to predict category

Key embeddings parameters:

ParameterValuePurpose
embeddingPurposeGENERIC_INDEX (indexing) and IMAGE_RETRIEVAL (querying)Optimizes for product image retrieval
embeddingDimension1024Balances accuracy and performance
detailLevelSTANDARD_IMAGESuitable for product photos

Use case: Intelligent document retrieval

Financial analysts, legal teams, and researchers need to quickly find specific information (tables, charts, clauses) across complex multi-page documents without manual review. The following diagram illustrates a high-level solution:

generate graphic document embeddings with Amazon Nova Multimodal Embeddings

  1. Convert each PDF page to a high-resolution image
  2. Generate embeddings for all document pages
  3. Store embeddings in a vector database
  4. Accept natural language queries and convert to embeddings
  5. Retrieve the top K most relevant pages based on semantic similarity
  6. Return pages with financial tables, charts, or specific content

Key embeddings parameters:

ParameterValuePurpose
embeddingPurposeGENERIC_INDEX (indexing) and DOCUMENT_RETRIEVAL (querying)Optimizes for document content understanding
embeddingDimension3072Highest precision for complex document structures
detailLevelDOCUMENT_IMAGEPreserves tables, charts, and text layout

When dealing with text-based documents that lack visual elements, it’s recommended to extract the text content and apply a chunking strategy and to use GENERIC_INDEX for indexing and TEXT_RETRIEVAL for querying.

Use case: Video clips search

Media applications require efficient methods to locate specific video clips from extensive video libraries using natural language descriptions. By converting videos and text queries into embeddings within a unified semantic space, similarity matching can be used to retrieve relevant video segments. The following diagram illustrates a high-level solution:

Video clip search with Amazon Nova Multimodal Embeddings

  1. Generate embeddings with Amazon Nova Multimodal Embeddings using the invoke_model API for short videos or the start_async_invoke API for long videos with segmentation
  2. Store embeddings in a vector database
  3. Accept natural language queries and convert to embeddings
  4. Retrieve the top K video clips from the vector database for review or further editing

Key embeddings parameters:

ParameterValuePurpose
EmbeddingPurposeGENERIC_INDEX (indexing) and VIDEO_RETRIEVAL (querying)Optimize for video indexing and retrieval
embeddingDimension1024Balance precision and cost
embeddingModeAUDIO_VIDEO_COMBINEDFuse visual and audio content.

Use case: Audio fingerprinting

Music applications and copyright management systems need to identify duplicate or similar audio content, and match audio segments to source tracks for copyright detection and content recognition. The following diagram illustrates a high-level solution:

Audio fingerprinting with Amazon Nova Multimodal Embeddings

  1. Convert audio files to embeddings using Amazon Nova Multimodal Embeddings
  2. Store embeddings in a vector database with genre and other metadata
  3. Query with audio segments and find the top K similar tracks
  4. Compare similarity scores to identify source matches and detect duplicates

Key embeddings parameters:

ParameterValuePurpose
embeddingPurposeGENERIC_INDEX (indexing) and AUDIO_RETRIEVAL (querying)Optimizes for audio fingerprinting and matching
embeddingDimension1024Balances accuracy and performance for audio similarity

Conclusion

You can use Amazon Nova Multimodal Embeddings to work with diverse data types within a unified semantic space. By supporting text, images, documents, video, and audio through flexible purpose-optimized embedding API parameters, you can build more effective retrieval systems, classification pipelines, and semantic search applications. Whether you’re implementing cross-modal search, document intelligence, or product classification, Amazon Nova Multimodal Embeddings provides the foundation to extract insights from unstructured data at scale. Start exploring the Amazon Nova Multimodal Embeddings: State-of-the-art embedding model for agentic RAG and semantic search and GitHub samples to integrate Amazon Nova Multimodal Embeddings into your applications today.


About the authors

Yunyi Gao is a Generative AI Specialiat Solutions Architect at Amazon Web Services (AWS), responsible for consulting on the design of AWS AI/ML and GenAI solutions and architectures.

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

A practical guide to Amazon Nova Multimodal Embeddings
Author: Yunyi Gao