Introducing the New Amazon Data Firehose Capability: Streaming Database Updates to Apache Iceberg on Amazon S3
Amazon Web Services (AWS) has rolled out a preview of an exciting new feature in Amazon Data Firehose that allows users to capture database changes from popular systems like PostgreSQL and MySQL. This feature is designed to replicate these updates seamlessly into Apache Iceberg tables stored on Amazon Simple Storage Service (S3). In this post, we’ll explore this new capability, what it means for users, and how it can be leveraged to boost data analytics and machine learning (ML) applications.
Understanding the Components
Apache Iceberg: This is a high-performance table format that is open-sourced for executing big data analytics. Apache Iceberg enhances the reliability and simplicity associated with SQL tables and allows for concurrent data operations with analytics engines such as Apache Spark, Apache Flink, Trino, Apache Hive, and Apache Impala.
Amazon Simple Storage Service (Amazon S3): It is a storage service from AWS that offers industry-leading scalability, data availability, security, and performance. Businesses of all sizes can use Amazon S3 to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.
What Does the New Capability Offer?
This new feature in Amazon Data Firehose simplifies the process of streaming database updates, ensuring that there’s no adverse effect on the transaction performance of your database applications. In just a few minutes, users can set up a Data Firehose stream to deliver change data capture (CDC) updates from databases. This functionality allows for seamless replication of data from various databases into Iceberg tables on Amazon S3, which can then be used for real-time analytics and ML applications.
Advantages for AWS Enterprise Customers
AWS enterprise clients typically operate hundreds of databases for their transactional applications. These businesses often need to perform large-scale analytics and ML on the most recent datasets. Capturing changes in databases—such as when records are inserted, modified, or deleted—and transmitting those updates to a data warehouse or data lake in formats like Apache Iceberg is crucial for these operations.
Traditional methods involve developing extract, transform, and load (ETL) jobs to read data from databases periodically. However, these ETL processes can hinder database transaction performance and introduce delays before data is available for analysis. By streaming database changes—referred to as CDC streams—organizations can mitigate these issues.
Streamlining with Amazon Data Firehose
Previously, setting up systems for CDC streams required significant manual effort, including the installation and configuration of multiple open-source components like Debezium and Apache Kafka Connect clusters. This setup process could take days or even weeks, and ongoing maintenance added to operational overhead.
The new feature in Amazon Data Firehose simplifies this by enabling the acquisition and continuous replication of CDC streams from databases to Apache Iceberg tables on Amazon S3. Users only need to specify the source and destination to set up a Data Firehose stream. The service captures an initial data snapshot and continuously replicates any subsequent changes in the selected database tables. This reduces the impact on database transactions and eliminates the need for capacity provisioning or cluster management.
Additionally, Data Firehose can automatically create Apache Iceberg tables using the same schema as the database tables, while also adapting the target schema to accommodate changes like new column additions.
Fully Managed Service Benefits
Being a fully managed service, Amazon Data Firehose eliminates the reliance on open-source components, the need for software updates, and the associated operational overhead. This makes it a simple, scalable, end-to-end solution for delivering CDC streams into data lakes or warehouses, enabling large-scale analytics and ML applications.
Setting Up a CDC Pipeline
Setting up a CDC pipeline with Amazon Data Firehose can be done via the AWS Management Console, AWS CLI, AWS SDKs, AWS CloudFormation, or Terraform. For demonstration purposes, let’s consider using a MySQL database on Amazon RDS as the source.
- Create a VPC Service Endpoint: To establish a secure connection between your virtual private cloud (VPC) and the RDS API without exposing traffic to the internet, create an AWS PrivateLink VPC service endpoint.
- Configure an S3 Bucket: Set up an S3 bucket to host the Iceberg table. Ensure you have an AWS Identity and Access Management (IAM) role with appropriate permissions.
- Create a Firehose Stream: In the Amazon Data Firehose section of the AWS Management Console, create a new Firehose stream by selecting the source (e.g., MySQL database) and destination (e.g., Apache Iceberg Tables). Provide the necessary details like the Firehose stream name, database endpoint, and database VPC endpoint service name.
- Specify Data to Capture: Configure Data Firehose to capture specific data by designating databases, tables, and columns using explicit names or regular expressions.
- Set Up Watermark Table: Create a watermark table, which helps track the progress of incremental snapshots of database tables.
- Configure S3 Bucket Region and Name: Specify the Region and name of the S3 bucket to use. Data Firehose can automatically create Iceberg tables if they don’t exist and update table schemas when detecting changes in your database schema.
- Enable CloudWatch Logging: To monitor the stream’s progress and errors, enable Amazon CloudWatch error logging. You can configure a short retention period to reduce storage costs.
Once the setup is complete, the stream will begin replicating data, allowing you to monitor its status and check for errors.
Testing the Stream
To test the stream, you can insert a new record into a database table. Navigate to the S3 bucket configured as the destination to observe that a file has been created to store data from the table. While downloading and inspecting Parquet files is typically for demonstration purposes, in practical scenarios, you would use AWS Glue and Amazon Athena to manage your data catalog and run SQL queries.
Additional Information
- Support for Self-Managed Databases: This new capability supports self-managed PostgreSQL and MySQL databases on Amazon EC2, as well as specific databases on Amazon RDS.
- Future Database Support: The AWS team is working on supporting additional databases like SQL Server, Oracle, and MongoDB.
- AWS PrivateLink Integration: Data Firehose uses AWS PrivateLink to connect to databases within your Amazon VPC.
- Wildcard Usage: When setting up an Amazon Data Firehose delivery stream, you can specify specific tables and columns or use wildcards. If new tables and columns matching the wildcard are added post-stream creation, Data Firehose will automatically create those in the destination.
Pricing and Availability
The new data streaming capability is available in all AWS Regions except China, AWS GovCloud (US), and Asia Pacific (Malaysia). During the preview period, usage is free, but future pricing will be based on usage, such as the quantity of bytes read and delivered. There are no commitments or upfront investments required.
For detailed pricing information, visit the AWS pricing page. To get started, configure your first continual database replication to Apache Iceberg tables on Amazon S3 by visiting Amazon Data Firehose.
For more Information, Refer to this article.