Accelerate ML feature pipelines with new capabilities in Amazon SageMaker Feature Store

TutoSartup excerpt from this article:
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models… The first is securing access to sensitive feature data without introducing manual overhead for every new feature group… Meanwhile, infrastructure tea…

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. It now supports Apache Iceberg table format, streaming ingestion, scalable batch ingestion, and fine-grained access control through AWS Lake Formation.

As organizations scale their machine learning platforms from experimentation to production, two operational challenges consistently surface. The first is securing access to sensitive feature data without introducing manual overhead for every new feature group. The second is keeping storage costs predictable when high-frequency streaming workloads generate ever-growing volumes of Apache Iceberg metadata. For example, one retail analytics team discovered that their Apache Iceberg-based offline store had accumulated over 50 TB of metadata files in under a year, driving substantial and unexpected Amazon Simple Storage Service (Amazon S3) charges. Meanwhile, infrastructure teams across industries told us they need Lake Formation-enforced access control on feature data that works automatically at the point of feature group creation. They don’t want it as an afterthought requiring repetitive manual configuration.

Today, we’re announcing three new capabilities available in SageMaker Python SDK v3.8.0 that address these challenges:

Native AWS Lake Formation integration – Register your offline store with Lake Formation during feature group creation, or for existing feature groups, to enforce column-level, row-level, and cell-level access control. No manual Lake Formation setup required.
Additional Apache Iceberg table properties – Control metadata retention and snapshot lifecycle policies at feature group creation or on existing feature groups to prevent metadata accumulation and reduce storage costs.
Feature Store support in SageMaker Python SDK v3 – The modernized SDK v3.8.0 brings the full set of Feature Store capabilities, including these new features, into a modular, faster, lighter-weight package.

In this post, we walk through each capability with code examples you can use to get started. For complete end-to-end walkthroughs, see the accompanying notebooks for Lake Formation governance and Iceberg table properties in the SageMaker Python SDK repository.

Prerequisites

To follow along with the examples in this post, you need:

An AWS account with permissions to create Amazon SageMaker AI resources.
An Amazon SageMaker AI execution role with access to Amazon S3, AWS Glue, and AWS Lake Formation.
SageMaker Python SDK v3.8.0 or later. You can use the following command to install SageMaker: pip install --upgrade "sagemaker>=3.8.0"
For Lake Formation integration: at least one Data Lake Administrator configured in your account. Feature Store validates this before activating access control.
An existing Amazon S3 bucket for offline store data.

Solution overview

These capabilities are delivered through new parameters in the SDK v3 FeatureGroupManager.create() and FeatureGroupManager.update() calls. The LakeFormationConfigtriggers automatic access control setup, and the IcebergProperties configures metadata lifecycle. Both can be set at feature group creation time or applied to existing feature groups.

Feature Store in SageMaker Python SDK v3

SageMaker Python SDK v3.8.0, released April 16, 2026, is the foundation for the capabilities described in this post. The modernized SDK introduces a modular architecture, improved performance, and removal of legacy hard dependencies (such as PyTorch). These changes result in faster installation and smaller environments.

The following Feature Store capabilities are available in SDK v3:

Feature group lifecycle management: Create, describe, update, delete, and list feature groups.
Record operations: PutRecord, GetRecord, and BatchGetRecord.
Training dataset extraction: Point-in-time–correct queries for building training datasets.
DataFrame ingestion: FeatureGroupManager.ingest() from both Pandas and Spark DataFrames.
New offline store parameters: IcebergProperties and LakeFormationConfig are fully supported in the create and update workflows.

The Feature Store API surface is consistent with SDK v2, so existing code works with minimal changes. Review the SDK v3 changelog for details on breaking changes in other areas of the SDK.

Quick start with SDK v3

Here’s how to create a feature group with the new Lake Formation and Iceberg parameters:

fg = FeatureGroupManager.create(
    feature_group_name="my-features",
    record_identifier_feature_name="user_id",
    event_time_feature_name="event_time",
    feature_definitions=df,
    role_arn=role,
    online_store_config={"EnableOnlineStore": True},
    offline_store_config=OfflineStoreConfig(
        s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
        table_format="Iceberg",
    ),
    lake_formation_config=LakeFormationConfig(
        enabled=True,
        hybrid_access_mode_enabled=True,
        acknowledge_risk=True,
    ),
    iceberg_properties=IcebergProperties(
        properties={
            "write.metadata.delete-after-commit.enabled": "true",
            "write.metadata.previous-versions-max": "10",
        }
    ),
)

Govern your offline store with native Lake Formation integration

Configuring AWS Lake Formation on Feature Store data previously required several manual steps: registering S3 locations, revoking the IAMAllowedPrincipals group, and configuring data filters for each feature group. This process was time-consuming, error-prone, and had to be repeated for every new feature group. Organizations in financial services, healthcare, and other regulated industries that need column-level, row-level, and cell-level access control found this particularly burdensome.

You can now activate Lake Formation access control on a feature group’s offline store at creation time by passing a LakeFormationConfig to FeatureGroupManager.create(). You can also activate it on existing feature groups using FeatureGroupManager.enable_lake_formation(). When this configuration is turned on, Feature Store automatically performs the following operations on your behalf:

Adds the S3 data location to Lake Formation. The offline store S3 prefix is registered as a Lake Formation–governed data lake location. Trusted analytics services (Amazon Athena, AWS Glue, Amazon EMR, Amazon Redshift Spectrum) then receive temporary credentials from Lake Formation to query the data.
Disables hybrid access mode (optional). When you set hybrid_access_mode_enabled=False, the SDK revokes the IAMAllowedPrincipal grant on the AWS Glue table, so access must go through Lake Formation’s permission model only. With hybrid_access_mode_enabled=True, both AWS Identity and Access Management (IAM) policies and Lake Formation permissions coexist, which is useful for gradual migration. For more information, see hybrid access mode.
Provides a recommended S3 deny policy. For customers who need end-to-end governance, the SDK logs a recommended bucket policy as a warning message after activation. Review this policy and apply it to your Amazon S3 bucket to block direct S3 reads for unauthorized principals, closing the last path that could bypass Lake Formation.

This is an opt-in, per-feature-group setting. If you omit it, behavior is unchanged and existing feature groups continue to work with IAM-based access.

Code example

The following creates a new feature group with Lake Formation access control activated. For additional configuration options, see Enable Lake Formation with Feature Groups.

fg = FeatureGroupManager.create(
    feature_group_name="governed-customer-features",
    record_identifier_feature_name="customer_id",
    event_time_feature_name="event_time",
    feature_definitions=customer_df,
    role_arn=role,
    online_store_config={"EnableOnlineStore": True},
    offline_store_config=OfflineStoreConfig(
        s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
        table_format="Iceberg",
    ),
    lake_formation_config=LakeFormationConfig(
        enabled=True,
        hybrid_access_mode_enabled=True,
        acknowledge_risk=True,
    ),
)

To activate Lake Formation on an existing feature group:

fg = FeatureGroupManager.get(
    feature_group_name="existing-feature-group",
)
fg.enable_lake_formation(
    hybrid_access_mode_enabled=True,
    acknowledge_risk=True,
)

After the feature group is configured, use the Lake Formation console or API to grant fine-grained permissions. You can grant a data science team SELECT access to only the customer_id, credit_score, and region columns (column-level filtering). You can also restrict an analyst to rows where region = 'us-east-1' (row-level filtering), or combine both for cell-level access control.

Key considerations

Online store isn’t affected. Lake Formation access control applies only to the offline store. The online store continues to use IAM-based authorization, so real-time inference latency is unchanged.

Works with both AWS Glue and Iceberg table formats. Lake Formation access control applies the same way regardless of which table format you use for the offline store.

Cross-account compatible. If you use AWS Resource Access Manager (AWS RAM) to share Feature Store tables across accounts, Lake Formation grants continue to work alongside existing cross-account sharing patterns. Note: you must disable hybrid access mode for cross-account access when the table format is Iceberg.

Prerequisite: Data Lake Administrator. The system validates that at least one Data Lake Administrator is configured in your account before activating access control. If none exists, the create call returns an immediate, descriptive error rather than failing asynchronously.

For more information, see Enable Lake Formation with Feature Groups.

Manage your offline store with additional Iceberg table properties

Amazon SageMaker Feature Store supports Apache Iceberg as a table format for the offline store, which improves query performance through compaction and supports record-level operations. This section introduces new parameters that give you control over Iceberg metadata lifecycle.

For workloads with high-frequency writes (such as streaming feature pipelines that ingest records every few seconds), Iceberg metadata files accumulate with every commit. Without lifecycle controls, this metadata can grow exponentially. One customer with over 40 streaming feature groups saw their S3 bucket grow from a few gigabytes to over 50 TB of metadata in under a year. Feature Store was committing to the offline store at high frequency (under 10 minutes between commits), and each commit produced new metadata files. Without write properties preset to limit snapshots or metadata file retention, the metadata accumulated unchecked. The cleanup operations they attempted through Amazon Athena (OPTIMIZE and VACUUM) timed out on tables exceeding 50 TB. They had to resort to costly Amazon EMR Serverless Spark jobs and eventually rewrite their tables entirely.

The solution

You can now pass an IcebergProperties configuration when creating an Iceberg-format feature group. These properties are applied to the underlying Iceberg table, giving you control over metadata lifecycle from day one. You can also update Iceberg properties on existing feature groups using FeatureGroupManager.update().

Some examples of supported properties are:

Property	Default	Description
`write.metadata.delete-after-commit.enabled`	`false`	Delete oldest tracked metadata files after each commit
`write.metadata.previous-versions-max`	100	Max number of previous version metadata files to track
`history.expire.max-snapshot-age-ms`	`432000000` (5 days)	Max age of snapshots to keep while expiring
`history.expire.min-snapshots-to-keep`	1	Min number of snapshots to keep while expiring
`write.target-file-size-bytes`	`536870912` (512 MB)	Target size for generated data files
`write.parquet.row-group-size-bytes`	`134217728` (128 MB)	Parquet row group size
`read.split.target-size`	`134217728` (128 MB)	Target size when combining data input splits

For the complete list of supported properties, see Iceberg metadata management in the SageMaker AI documentation.

Code example


fg = FeatureGroupManager.create(
    feature_group_name="streaming-click-features",
    record_identifier_feature_name="session_id",
    event_time_feature_name="event_time",
    feature_definitions=clicks_df,
    role_arn=role,
    offline_store_config=OfflineStoreConfig(
        s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
        table_format="Iceberg",
    ),
    iceberg_properties=IcebergProperties(
        properties={
            "write.metadata.delete-after-commit.enabled": "true",
            "write.metadata.previous-versions-max": "10",
            "history.expire.max-snapshot-age-ms": "86400000",
            "history.expire.min-snapshots-to-keep": "5",
            "write.target-file-size-bytes": "536870912",
        }
    ),
)

To update Iceberg properties on an existing feature group:

fg = FeatureGroupManager.get(
    feature_group_name="existing-feature-group",
    include_iceberg_properties=True,
)
fg.update(
    iceberg_properties=IcebergProperties(
        properties={
            "write.metadata.delete-after-commit.enabled": "true",
            "write.metadata.previous-versions-max": "10",
        }
    )
)

Best practices

Start with metadata cleanup for streaming workloads. If your pipeline writes to the offline store more than once per minute, set write.metadata.delete-after-commit.enabled to "true" and limit write.metadata.previous-versions-max. This is the single most impactful configuration change for preventing storage cost overruns.

Continue running compaction. These properties manage metadata lifecycle, but you still need to run Iceberg compaction (using Athena OPTIMIZE + VACUUM or Spark maintenance actions) to merge small data files for optimal query performance.

Tune snapshot retention for compliance needs. Audit-heavy workloads that require time-travel queries should use higher values for history.expire.min-snapshots-to-keep and history.expire.max-snapshot-age-ms. Cost-optimized streaming pipelines benefit from shorter retention.

Set properties at creation time. These properties take effect on new commits. For existing feature groups with accumulated metadata, use FeatureGroupManager.update() to set properties, then run Spark snapshot expiration and orphan file deletion to reclaim storage.

For the complete list of supported properties, see Iceberg metadata management.

Putting it together

By combining both capabilities in a single FeatureGroupManager.create() call, you produce a feature group that’s simultaneously governed and cost-optimized. No follow-up configuration is required. The offline store metadata is automatically managed, and Lake Formation access control is active without manual registration. The online store continues to serve low-latency features with IAM authorization.

fg = FeatureGroupManager.create(
    feature_group_name="real-time-user-signals",
    record_identifier_feature_name="user_id",
    event_time_feature_name="event_time",
    feature_definitions=signals_df,
    role_arn=role,
    online_store_config={"EnableOnlineStore": True},
    offline_store_config=OfflineStoreConfig(
        s3_storage_config=S3StorageConfig(s3_uri=f"s3://{bucket}/feature-store/"),
        table_format="Iceberg",
    ),
    lake_formation_config=LakeFormationConfig(
        enabled=True,
        hybrid_access_mode_enabled=True,
        acknowledge_risk=True,
    ),
    iceberg_properties=IcebergProperties(
        properties={
            "write.metadata.delete-after-commit.enabled": "true",
            "write.metadata.previous-versions-max": "10",
            "history.expire.max-snapshot-age-ms": "86400000",
            "history.expire.min-snapshots-to-keep": "5",
        }
    ),
)

For complete end-to-end notebooks with step-by-step instructions, see the Lake Formation governance notebook and the Iceberg table properties notebook in the SageMaker Python SDK repository.

Cleanup

To avoid ongoing charges, delete the feature groups that you created while following this walkthrough. If you added Amazon S3 locations to Lake Formation, deregister them through the Lake Formation console or the DeregisterResource API. Revoke the Lake Formation permissions you granted for testing.

Conclusion

Together, these enhancements make Amazon SageMaker Feature Store simpler to secure, more cost-efficient to operate, and faster to integrate into your ML pipelines. By automating Lake Formation access control, surfacing fine-grained Iceberg lifecycle settings, and delivering these through a lightweight modular SDK. These changes remove the undifferentiated heavy lifting that previously stood between your team and production-ready feature management at scale. Whether you are onboarding your first feature group or managing hundreds across multiple teams, these capabilities help you move faster. You can be confident that access control and cost controls are built in from day one.We encourage you to upgrade to SageMaker Python SDK v3.8.0 and explore how these capabilities can streamline your existing workflows.

For more information, see the Feature Store documentation, the Lake Formation access control guide, the Iceberg metadata management guide, and the SDK v3 release notes. To get hands-on, try the Lake Formation notebook and the Iceberg properties notebook.

For background on Feature Store concepts and earlier capabilities, explore these related posts:

About the authors

Accelerate ML feature pipelines with new capabilities in Amazon SageMaker Feature Store
Author: Dhaval Shah