Introduction to Amazon SageMaker’s New Features
Amazon has recently unveiled the next generation of its Amazon SageMaker platform, which is a comprehensive solution designed for data, analytics, and artificial intelligence (AI). This upgraded platform integrates widely-used AWS machine learning and analytics features into a unified environment. At the heart of this evolution is the SageMaker Unified Studio, currently in preview, which serves as a single workspace for data exploration, preparation, integration, big data processing, rapid SQL analytics, model development, training, and generative AI application creation. One of the standout additions to this release is the Amazon SageMaker Lakehouse, a feature that bridges data from lakes and warehouses, allowing for the development of advanced analytics and AI/ML applications using a singular data instance.
Enhanced Data Management and Permissions
Alongside these new capabilities, Amazon has introduced advanced data catalog and permissions features within the SageMaker Lakehouse. These enhancements allow users to centrally connect, discover, and manage access to various data sources, streamlining the process for analysts and data scientists. In today’s digital landscape, organizations often distribute their data across different systems to cater to specific use cases and scaling needs. This distribution can lead to data silos across lakes, warehouses, databases, and streaming services, posing significant challenges for analysts and scientists who need to connect and analyze data from these disparate sources.
Traditionally, such tasks required setting up specialized connectors for each data source, managing multiple access policies, and frequently duplicating data, which can increase costs and result in data inconsistencies. The new SageMaker Lakehouse capability addresses these challenges by simplifying the connection process to popular data sources, cataloging them, applying permissions, and making data available for analysis through SageMaker Lakehouse and Amazon Athena. The integration uses AWS Glue Data Catalog as a centralized metadata repository for all data sources, providing a holistic view of available data.
Streamlining Data Source Connections
Data source connections are established once and can be reused, eliminating the need for repeated setup. As connections to data sources are made, databases and tables are automatically cataloged and registered with AWS Lake Formation. This cataloging process allows for easy access management, enabling data analysts to access necessary databases and tables without needing to individually connect to each source or understand internal data secrets. Lake Formation permissions facilitate fine-grained access control policies across data lakes, data warehouses, and online transaction processing (OLTP) data sources, ensuring consistent enforcement during queries made with Athena. Data remains in its original location, reducing the need for costly and time-consuming data transfers or duplications. Users can create or reuse existing data source connections within the Data Catalog and configure built-in connectors to multiple data sources, including Amazon S3, Amazon Redshift, Amazon Aurora, Amazon DynamoDB (preview), Google BigQuery, and more.
Implementation with Athena and Lake Formation
To demonstrate the integration between Athena and Lake Formation, a preconfigured environment incorporating Amazon DynamoDB as a data source is used. This environment is equipped with relevant tables and data to effectively showcase the new capability, utilizing the SageMaker Unified Studio (preview) interface.
Setting Up Projects
Accessing the SageMaker Unified Studio (preview) through the Amazon SageMaker domain allows users to create and manage projects, which serve as collaborative workspaces. These projects enable team members to work together on data and develop machine learning models. Creating a project automatically sets up AWS Glue Data Catalog databases, establishes a catalog for Redshift Managed Storage (RMS) data, and provisions necessary permissions. Users can manage projects by either viewing a list of existing projects or creating a new one. For demonstration purposes, two existing projects are used: sales-group, where administrators have full access to all data, and marketing-project, where analysts have restricted data access permissions.
Adding Data Sources
In the demonstration, a federated catalog for the target data source, Amazon DynamoDB, is set up. Navigating to the Data section in the left pane and selecting the plus sign allows for the addition of new data connections. After selecting Amazon DynamoDB and proceeding through the setup, the federated catalog is created within SageMaker Lakehouse. Administrators can then grant access using resource policies, which have been preconfigured in this environment. The fine-grained access controls are then showcased within the SageMaker Unified Studio (preview).
Querying with Athena
Within the sales-group project, administrators have full access to customer data, including fields such as zip codes, customer IDs, and phone numbers. By selecting the Query with Athena option, queries can be executed to analyze this data. The integrated query environment provides a seamless workspace for data exploration and analysis. Switching to the marketing-project environment demonstrates the experience of an analyst, verifying that fine-grained access control permissions are in effect and restricting data access as intended. Example queries illustrate how analysts interact with data while adhering to established security controls.
New Capabilities Now Available
These new data catalog and permissions capabilities streamline data operations, enhance security governance, and accelerate AI/ML development while maintaining data integrity and compliance across the entire data ecosystem. Amazon SageMaker Lakehouse’s data catalog and permissions features simplify interactive analytics through federated queries, providing a unified catalog and permissions system across multiple data sources. This allows for a single place to define and enforce fine-grained security policies across data lakes, data warehouses, and OLTP data sources, ensuring a high-performing query experience.
These new features are available in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), and Asia Pacific (Tokyo) AWS Regions. To get started with this new capability, users can refer to the Amazon SageMaker Lakehouse documentation.
For more information, visit the official Amazon SageMaker Lakehouse documentation page.
For more Information, Refer to this article.