Enhance speech synthesis and video generation models with RLHF using audio and video segmentation in Amazon SageMaker

TutoSartup excerpt from this article:
This setup enables the model to learn from human-labeled data, refining its ability to produce content that aligns with natural human expectations… We guide you through deploying the necessary infrastructure using AWS CloudFormation, creating an internal labeling workforce, and setting up your fi…

As generative AI models advance in creating multimedia content, the difference between good and great output often lies in the details that only human feedback can capture. Audio and video segmentation provides a structured way to gather this detailed feedback, allowing models to learn through reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT). Annotators can precisely mark and evaluate specific moments in audio or video content, helping models understand what makes content feel authentic to human viewers and listeners.

Take, for instance, text-to-video generation, where models need to learn not just what to generate but how to maintain consistency and natural flow across time. When creating a scene of a person performing a sequence of actions, factors like the timing of movements, visual consistency, and smoothness of transitions contribute to the quality. Through precise segmentation and annotation, human annotators can provide detailed feedback on each of these aspects, helping models learn what makes a generated video sequence feel natural rather than artificial. Similarly, in text-to-speech applications, understanding the subtle nuances of human speech—from the length of pauses between phrases to changes in emotional tone—requires detailed human feedback at a segment level. This granular input helps models learn how to produce speech that sounds natural, with appropriate pacing and emotional consistency. As large language models (LLMs) increasingly integrate more multimedia capabilities, human feedback becomes even more critical in training them to generate rich, multi-modal content that aligns with human quality standards.

The path to creating effective AI models for audio and video generation presents several distinct challenges. Annotators need to identify precise moments where generated content matches or deviates from natural human expectations. For speech generation, this means marking exact points where intonation changes, where pauses feel unnatural, or where emotional tone shifts unexpectedly. In video generation, annotators must pinpoint frames where motion becomes jerky, where object consistency breaks, or where lighting changes appear artificial. Traditional annotation tools, with basic playback and marking capabilities, often fall short in capturing these nuanced details.

Amazon SageMaker Ground Truth enables RLHF by allowing teams to integrate detailed human feedback directly into model training. Through custom human annotation workflows, organizations can equip annotators with tools for high-precision segmentation. This setup enables the model to learn from human-labeled data, refining its ability to produce content that aligns with natural human expectations.

In this post, we show you how to implement an audio and video segmentation solution in the accompanying GitHub repository using SageMaker Ground Truth. We guide you through deploying the necessary infrastructure using AWS CloudFormation, creating an internal labeling workforce, and setting up your first labeling job. We demonstrate how to use Wavesurfer.js for precise audio visualization and segmentation, configure both segment-level and full-content annotations, and build the interface for your specific needs. We cover both console-based and programmatic approaches to creating labeling jobs, and provide guidance on extending the solution with your own annotation needs. By the end of this post, you will have a fully functional audio/video segmentation workflow that you can adapt for various use cases, from training speech synthesis models to improving video generation capabilities.

Feature Overview

The integration of Wavesurfer.js in our UI provides a detailed waveform visualization where annotators can instantly see patterns in speech, silence, and audio intensity. For instance, when working on speech synthesis, annotators can visually identify unnatural gaps between words or abrupt changes in volume that might make generated speech sound robotic. The ability to zoom into these waveform patterns means they can work with millisecond precision—marking exactly where a pause is too long or where an emotional transition happens too abruptly.

In this snapshot of audio segmentation, we are capturing a customer-representative conversation, annotating speaker segments, emotions, and transcribing the dialogue. The UI allows for playback speed adjustment and zoom functionality for precise audio analysis.

The multi-track feature lets annotators create separate tracks for evaluating different aspects of the content. In a text-to-speech task, one track might focus on pronunciation accuracy, another on emotional consistency, and a third on natural pacing. For video generation tasks, annotators can mark segments where motion flows naturally, where object consistency is maintained, and where scene transitions work well. They can adjust playback speed to catch subtle details, and the visual timeline for precise start and end points for each marked segment.

In this snapshot of video segmentation, we’re annotating a scene with dogs, tracking individual animals, their colors, emotions, and gaits. The UI also enables overall video quality assessment, scene change detection, and object presence classification.

Annotation process

Annotators begin by choosing Add New Track and selecting appropriate categories and tags for their annotation task. After you create the track, you can choose Begin Recording at the point where you want to start a segment. As the content plays, you can monitor the audio waveform or video frames until you reach the desired end point, then choose Stop Recording. The newly created segment appears in the right pane, where you can add classifications, transcriptions, or other relevant labels. This process can be repeated for as many segments as needed, with the ability to adjust segment boundaries, delete incorrect segments, or create new tracks for different annotation purposes.

Importance of high-quality data and reducing labeling errors

High-quality data is essential for training generative AI models that can produce natural, human-like audio and video content. The performance of these models depends directly on the accuracy and detail of human feedback, which stems from the precision and completeness of the annotation process. For audio and video content, this means capturing not just what sounds or looks unnatural, but exactly when and how these issues occur.

Our purpose built UI in SageMaker Ground Truth addresses common challenges in audio and video annotation that often lead to inconsistent or imprecise feedback. When annotators work with long audio or video files, they need to mark precise moments where generated content deviates from natural human expectations. For example, in speech generation, an unnatural pause might last only a fraction of a second, but its impact on perceived quality is significant. The tool’s zoom functionality allows annotators to expand these brief moments across their screen, making it possible to mark the exact start and end points of these subtle issues. This precision helps models learn the fine details that separate natural from artificial-sounding speech.

Solution overview

This audio/video segmentation solution combines several AWS services to create a robust annotation workflow. At its core, Amazon Simple Storage Service (Amazon S3) serves as the secure storage for input files, manifest files, annotation outputs, and the web UI components. SageMaker Ground Truth provides annotators with a web portal to access their labeling jobs and manages the overall annotation workflow. The following diagram illustrates the solution architecture.

The UI template, which includes our specialized audio/video segmentation interface built with Wavesurfer.js, requires specific JavaScript and CSS files. These files are hosted through Amazon CloudFront distribution, providing reliable and efficient delivery to annotators’ browsers. By using CloudFront with an origin access identity and appropriate bucket policies, we allow the UI components to be served to annotators. This setup follows AWS best practices for least-privilege access, making sure CloudFront can only access the specific UI files needed for the annotation interface.

Pre-annotation and post-annotation AWS Lambda functions are optional components that can enhance the workflow. The pre-annotation Lambda function can process the input manifest file before data is presented to annotators, enabling any necessary formatting or modifications. Similarly, the post-annotation Lambda function can transform the annotation outputs into specific formats required for model training. These functions provide flexibility to adapt the workflow to specific needs without requiring changes to the core annotation process.

The solution uses AWS Identity and Access Management (IAM) roles to manage permissions:

A SageMaker Ground Truth IAM role enables access to Amazon S3 for reading input files and writing annotation outputs
If used, Lambda function roles provide the necessary permissions for preprocessing and postprocessing tasks

Let’s walk through the process of setting up your annotation workflow. We start with a simple scenario: you have an audio file stored in Amazon S3, along with some metadata like a call ID and its transcription. By the end of this walkthrough, you will have a fully functional annotation system where your team can segment and classify this audio content.

Prerequisites

For this walkthrough, make sure you have the following:

Familiarity with SageMaker Ground Truth labeling jobs and the workforce portal
Basic understanding of CloudFormation templates
An AWS account with permissions to deploy CloudFormation stacks
A SageMaker Ground Truth private workforce configured for labeling jobs
Permissions to launch CloudFormation stacks that create and configure S3 buckets, CloudFront distributions, and Lambda functions automatically

Create your internal workforce

Before we dive into the technical setup, let’s create a private workforce in SageMaker Ground Truth. This allows you to test the annotation workflow with your internal team before scaling to a larger operation.

On the SageMaker console, choose Labeling workforces.
Choose Private for the workforce type and create a new private team.
Add team members using their email addresses—they will receive instructions to set up their accounts.

Deploy the infrastructure

Although this demonstrates using a CloudFormation template for quick deployment, you can also set up the components manually. The assets (JavaScript and CSS files) are available in our GitHub repository. Complete the following steps for manual deployment:

Download these assets directly from the GitHub repository.
Host them in your own S3 bucket.
Set up your own CloudFront distribution to serve these files.
Configure the necessary permissions and CORS settings.

This manual approach gives you more control over infrastructure setup and might be preferred if you have existing CloudFront distributions or a need to customize security controls and assets.

The rest of this post will focus on the CloudFormation deployment approach, but the labeling job configuration steps remain the same regardless of how you choose to host the UI assets.

This CloudFormation template creates and configures the following AWS resources:

S3 bucket for UI components:
- Stores the UI JavaScript and CSS files
- Configured with CORS settings required for SageMaker Ground Truth
- Accessible only through CloudFront, not directly public
- Permissions are set using a bucket policy that grants read access only to the CloudFront Origin Access Identity (OAI)
CloudFront distribution:
- Provides secure and efficient delivery of UI components
- Uses an OAI to securely access the S3 bucket
- Is configured with appropriate cache settings for optimal performance
- Access logging is enabled, with logs being stored in a dedicated S3 bucket
S3 bucket for CloudFront logs:
- Stores access logs generated by CloudFront
- Is configured with the required bucket policies and ACLs to allow CloudFront to write logs
- Object ownership is set to ObjectWriter to enable ACL usage for CloudFront logging
- Lifecycle configuration is set to automatically delete logs older than 90 days to manage storage
Lambda function:
- Downloads UI files from our GitHub repository
- Stores them in the S3 bucket for UI components
- Runs only during initial setup and uses least privilege permissions
- Permissions include Amazon CloudWatch Logs for monitoring and specific S3 actions (read/write) limited to the created bucket

After the CloudFormation stack deployment is complete, you can find the CloudFront URLs for accessing the JavaScript and CSS files on the AWS CloudFormation console. You need these CloudFront URLs to update your UI template before creating the labeling job. Note these values—you will use them when creating the labeling job.

Prepare your input manifest

Before you create the labeling job, you need to prepare an input manifest file that tells SageMaker Ground Truth what data to present to annotators. The manifest structure is flexible and can be customized based on your needs. For this post, we use a simple structure:

{ 
"source": "s3://YOUR-BUCKET/audio/sample1.mp3", 
"call-id": "call-123", 
"transcription": "Customer: I'm really happy with your smart home security system. However, I have feature request that would make it betternRepresentative: We're always eager to hear from our customers. What feature would you like to see added ? " 
}

You can adapt this structure to include additional metadata that your annotation workflow requires. For example, you might want to add speaker information, timestamps, or other contextual data. The key is making sure your UI template is designed to process and display these attributes appropriately.

Create your labeling job

With the infrastructure deployed, let’s create the labeling job in SageMaker Ground Truth. For full instructions, refer to Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda.

On the SageMaker console, choose Create labeling job.
Give your job a name.
Specify your input data location in Amazon S3.
Specify an output bucket where annotations will be stored.
For the task type, select Custom labeling task.
In the UI template field, locate the placeholder values for the JavaScript and CSS files and update as follows:
1. Replace audiovideo-wavesufer.js with your CloudFront JavaScript URL from the CloudFormation stack outputs.
2. Replace audiovideo-stylesheet.css with your CloudFront CSS URL from the CloudFormation stack outputs.

<!-- Custom Javascript and Stylesheet -->
<script src="audiovideo-wavesufer.js"></script>
<link rel="stylesheet" href="audiovideo-stylesheet.css">

Before you launch the job, use the Preview feature to verify your interface.

You should see the Wavesurfer.js interface load correctly with all controls working properly. This preview step is crucial—it confirms that your CloudFront URLs are correctly specified and the interface is properly configured.

Programmatic setup

Alternatively, you can create your labeling job programmatically using the CreateLabelingJob API. This is particularly useful for automation or when you need to create multiple jobs. See the following code:

response = sagemaker.create_labeling_job(
    LabelingJobName="audio-segmentation-job-demo",
    LabelAttributeName="label",
    InputConfig={
        "DataSource": {
            "S3DataSource": {
                "ManifestS3Uri": "s3://your-bucket-name/path-to-manifest"
            }
        }
    },
    OutputConfig={
        "S3OutputPath": "s3://your-bucket-name/path-to-output-file"
    },
    RoleArn="arn:aws:iam::012345678910:role/SagemakerExecutionRole",

    # Optionally add PreHumanTaskLambdaArn or AnnotationConsolidationConfig
    HumanTaskConfig={
        "TaskAvailabilityLifetimeInSeconds": 21600,
        "TaskTimeLimitInSeconds": 3600,
        "WorkteamArn": "arn:aws:sagemaker:us-east-1:012345678910:workteam/private-crowd/work-team-name",
        "TaskDescription": " Evaluate model-generated text responses based on a reference image.",
        "MaxConcurrentTaskCount": 1000,
        "TaskTitle": " Evaluate Model Responses Based on Image References",
        "NumberOfHumanWorkersPerDataObject": 1,
        "UiConfig": {
            "UiTemplateS3Uri": "s3://your-bucket-name/path-to-ui-template"

The API approach offers the same functionality as the SageMaker console, but allows for automation and integration with existing workflows. Whether you choose the SageMaker console or API approach, the result is the same: a fully configured labeling job ready for your annotation team.

Understanding the output

After your annotators complete their work, SageMaker Ground Truth will generate an output manifest in your specified S3 bucket. This manifest contains rich information at two levels:

Segment-level classifications – Details about each marked segment, including start and end times and assigned categories
Full-content classifications – Overall ratings and classifications for the entire file

Let’s look at a sample output to understand its structure:

{
  "answers": [
    {
      "acceptanceTime": "2024-11-04T18:33:38.658Z",
      "answerContent": {
        "annotations": {
          "categories": {
            "language": [
              "English",
              "Hindi",
              "Spanish",
              "French",
              "German",
              "Dutch"
            ],
            "speaker": [
              "Customer",
              "Representative"
            ]
          },
          "startTimestamp": 1730745219028,
          "startUTCTime": "Mon, 04 Nov 2024 18:33:39 GMT",
          "streams": {
            "language": [
              {
                "id": "English",
                "start": 0,
                "end": 334.808635,
                "text": "Sample text in English",
                "emotion": "happy"
              },
              {
                "id": "Spanish",
                "start": 334.808635,
                "end": 550.348471,
                "text": "Texto de ejemplo en español",
                "emotion": "neutral"
              }
            ]
          },
          "endTimestamp": 1730745269602,
          "endUTCTime": "Mon, 04 Nov 2024 18:34:29 GMT",
          "elapsedTime": 50574
        },
        "backgroundNoise": {
          "ambient": false,
          "music": true,
          "traffic": false
        },
        "emotiontag": "Neutral",
        "environmentalSounds": {
          "birdsChirping": false,
          "doorbell": true,
          "footsteps": false
        },
        "rate": {
          "1": false,
          "2": false,
          "3": false,
          "4": false,
          "5": true
        },
        "textTranslationFinal": "sample text for transcription"
      }
    }
  ]
}

This two-level annotation structure provides valuable training data for your AI models, capturing both fine-grained details and overall content assessment.

Customizing the solution

Our audio/video segmentation solution is designed to be highly customizable. Let’s walk through how you can adapt the interface to match your specific annotation requirements.

Customize segment-level annotations

The segment-level annotations are controlled in the report() function of the JavaScript code. The following code snippet shows how you can modify the annotation options for each segment:

ranges.forEach(function (r) {
   // ... existing code ...
   
   // Example: Adding a custom dropdown for speaker identification
   var speakerDropdown = $('<select>').attr({
       name: 'speaker',
       class: 'custom-dropdown-width'
   });
   var speakerOptions = ['Speaker A', 'Speaker B', 'Multiple Speakers', 'Background Noise'];
   speakerOptions.forEach(function(option) {
       speakerDropdown.append($('<option>').val(option).text(option));
   });
   
   // Example: Adding a checkbox for quality issues
   var qualityCheck = $('<input>').attr({
       type: 'checkbox',
       name: 'quality_issue'
   });
   var qualityLabel = $('<label>').text('Contains Quality Issues');

   tr.append($('<TD>').append(speakerDropdown));
   tr.append($('<TD>').append(qualityCheck).append(qualityLabel));
   
   // Add event listeners for your new fields
   speakerDropdown.on('change', function() {
       r.speaker = $(this).val();
       updateTrackListData(r);
   });
   
   qualityCheck.on('change', function() {
       r.hasQualityIssues = $(this).is(':checked');
       updateTrackListData(r);
   });
});

You can remove existing fields or add new ones based on your needs. Make sure you’re updating the data model (updateTrackListData function) to handle your custom fields.

Modify full-content classifications

For classifications that apply to the entire audio/video file, you can modify the HTML template. The following code is an example of adding custom classification options:

<div class="row">
    <div class="col-6">
        <p><strong>Audio Quality Assessment:</strong></p>
        <label class="radio">
            <input type="radio" name="audioQuality" value="excellent" style="width: 20px;">
            Excellent
        </label>
        <label class="radio">
            <input type="radio" name="audioQuality" value="good" style="width: 20px;">
            Good
        </label>
        <label class="radio">
            <input type="radio" name="audioQuality" value="poor" style="width: 20px;">
            Poor
        </label>
    </div>
    <div class="col-6">
        <p><strong>Content Type:</strong></p>
        <label class="checkbox">
            <input type="checkbox" name="contentType" value="interview" style="width: 20px;">
            Interview
        </label>
        <label class="checkbox">
            <input type="checkbox" name="contentType" value="presentation" style="width: 20px;">
            Presentation
        </label>
    </div>
</div>

The classifications you add here will be included in your output manifest, allowing you to capture both segment-level and full-content annotations.

Extending Wavesurfer.js functionality

Our solution uses Wavesurfer.js, an open source audio visualization library. Although we’ve implemented core functionality for segmentation and annotation, you can extend this further using Wavesurfer.js’s rich feature set. For example, you might want to:

Add spectrogram visualization
Implement additional playback controls
Enhance zoom functionality
Add timeline markers

For these customizations, we recommend consulting the Wavesurfer.js documentation. When implementing additional Wavesurfer.js features, remember to test thoroughly in the SageMaker Ground Truth preview to review compatibility with the labeling workflow.

Wavesurfer.js is distributed under the BSD-3-Clause license. Although we’ve tested the integration thoroughly, modifications you make to the Wavesurfer.js implementation should be tested in your environment. The Wavesurfer.js community provides excellent documentation and support for implementing additional features.

Clean up

To clean up the resources created during this tutorial, follow these steps:

Stop the SageMaker Ground Truth labeling job if it’s still running and you no longer need it. This will halt ongoing labeling tasks and stop additional charges from accruing.
Empty the S3 buckets by deleting all objects within them. S3 buckets must be emptied before they can be deleted, so removing all stored files facilitates a smooth cleanup process.
Delete the CloudFormation stack to remove all the AWS resources provisioned by the template. This action will automatically delete associated services like the S3 buckets, CloudFront distribution, Lambda function, and related IAM roles.

Conclusion

In this post, we walked through implementing an audio and video segmentation solution using SageMaker Ground Truth. We saw how to deploy the necessary infrastructure, configure the annotation interface, and create labeling jobs both through the SageMaker console and programmatically. The solution’s ability to capture precise segment-level annotations along with overall content classifications makes it particularly valuable for generating high-quality training data for generative AI models, whether you’re working on speech synthesis, video generation, or other multimedia AI applications. As you develop your AI models for audio and video generation, remember that the quality of human feedback directly impacts your model’s performance—whether you’re training models to generate more natural-sounding speech, create coherent video sequences, or understand complex audio patterns.

We encourage you to visit our GitHub repository to explore the solution further and adapt it to your specific needs. You can enhance your annotation workflows by customizing the interface, adding new classification categories, or implementing additional Wavesurfer.js features. To learn more about creating custom labeling workflows in SageMaker Ground Truth, visit Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda and Custom labeling workflows.

If you’re looking for a turnkey data labeling solution, consider Amazon SageMaker Ground Truth Plus, which provides access to an expert workforce trained in various machine learning tasks. With SageMaker Ground Truth Plus, you can quickly receive high-quality annotations without the need to build and manage your own labeling workflows, reducing costs by up to 40% and accelerating the delivery of labeled data at scale.

Start building your annotation workflow today and contribute to the next generation of AI models that push the boundaries of what’s possible in audio and video generation.

About the Authors

Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers leverage SageMaker and Bedrock to build scalable and cost-efficient pipelines for computer vision applications, natural language processing, and generative AI. In his free time, Sundar loves exploring new places, sampling local eateries and embracing the great outdoors.

Vineet Agarwal is a Senior Manager of Customer Delivery in the Amazon Bedrock team responsible for Human in the Loop services. He has been in AWS for over 2 years managing Go-to-Market activities, business and technical operations. Prior to AWS, he worked in SaaS , Fintech and Telecommunications industry in services leadership role. He has MBA from the Indian School of Business and B. Tech in Electronics and Communications Engineering from National Institute of Technology, Calicut (India). In his free time, Vineet loves playing racquetball and enjoying outdoor activities with his family.

Enhance speech synthesis and video generation models with RLHF using audio and video segmentation in Amazon SageMaker
Author: Sundar Raghavan