In case you handle information in Amazon Aurora MySQL-Appropriate Version and wish to make it obtainable for analytics, machine studying (ML), or cross-service querying in a contemporary lakehouse format, you’re not alone.
Organizations usually must run analytics, construct ML fashions, or be part of information throughout a number of sources. These are examples of workloads that may be resource-intensive and impractical to run instantly in opposition to a transactional database. By extracting your Aurora MySQL information into Amazon S3 Tables in Apache Iceberg format, you’ll be able to offload analytical queries out of your manufacturing database with out impacting its efficiency, whereas storing information in a completely managed Iceberg desk retailer optimized for analytics. Constructed on the open Apache Iceberg commonplace, Amazon Easy Storage Service (Amazon S3) Desk information is queryable from engines like Amazon Athena, Amazon Redshift Spectrum, and Apache Spark with out further information copies. You can too mix relational information with different datasets already in your information lake, enabling richer cross-domain insights.
Apache Iceberg and Amazon S3 Tables
Apache Iceberg is a broadly adopted open desk format that provides Atomicity, Consistency, Isolation, Sturdiness (ACID) transactions, schema evolution, and time journey capabilities. It permits a number of engines to work concurrently on the identical dataset, making it a well-liked selection for constructing open lakehouse architectures.
Amazon S3 Tables is a purpose-built, absolutely managed Apache Iceberg desk retailer designed for analytics workloads. It delivers as much as 3x quicker question efficiency and as much as 10x extra transactions per second in comparison with self-managed Iceberg tables. It additionally robotically compacts information and removes unreferenced information to optimize storage and efficiency.
On this submit, you learn to arrange an automatic, end-to-end answer that extracts tables from Amazon Aurora MySQL Serverless v2 and writes them to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue. All the infrastructure is deployed utilizing a single AWS CloudFormation stack.
Necessities
AWS provides zero-ETL integrations from Amazon Aurora to Amazon Redshift and Amazon SageMaker AI, enabling seamless information circulate for analytics and machine studying workloads.
Nonetheless, there isn’t but a local zero-ETL integration between Amazon Aurora and Amazon S3 Tables. This implies that organizations trying to use Amazon S3 Tables for his or her Lakehouse structure presently face a number of necessities:
- Establishing ETL pipelines to extract information from Amazon Aurora and rework it into Apache Iceberg format
- Configuring networking and safety for AWS Glue jobs to entry Amazon Aurora databases in non-public subnets
- Coordinating the provisioning of supply databases, ETL pipelines, and goal desk shops
- Managing the end-to-end workflow with out native automation
Resolution overview
On this answer, you automate the extraction of relational database tables from Amazon Aurora MySQL Serverless v2 to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. That can assist you get began and check this answer, a CloudFormation template is supplied. This template provisions the required infrastructure, hundreds pattern information, and configures the Extract, Remodel, Load (ETL) pipeline. You may adapt this template to your personal state of affairs.

Pattern information
This answer makes use of the TICKIT pattern database, a widely known dataset utilized in Amazon Redshift documentation. The TICKIT information fashions a fictional ticket gross sales system with seven interrelated tables: customers, venue, class, date, occasion, itemizing, and gross sales. The dataset is publicly obtainable as talked about within the Amazon Redshift Getting Began Information.
Resolution circulate
The answer circulate as proven within the earlier structure diagram:
- An AWS Lambda operate downloads the TICKIT pattern dataset (a fictional ticket gross sales system utilized in Amazon Redshift documentation) from a public Amazon S3 bucket to a staging S3 bucket.
- A second Lambda operate, utilizing PyMySQL (a Python MySQL consumer library), hundreds the staged information information into the Aurora MySQL Serverless v2 database utilizing
LOAD DATA LOCAL INFILE. - The AWS Glue job reads seven TICKIT tables from Aurora MySQL via a local MySQL connection and writes them to Amazon S3 Tables in Apache Iceberg format utilizing the S3 Tables REST catalog endpoint with SigV4 authentication.
- You may question the migrated information in S3 Tables utilizing Amazon Athena.
The answer consists of the next key elements:
- Amazon Aurora MySQL Serverless v2 because the supply relational database containing the TICKIT pattern dataset (customers, venue, class, date, occasion, itemizing, and gross sales tables)
- AWS Secrets and techniques Supervisor to retailer the Aurora MySQL database credentials securely
- Amazon S3 staging bucket for the TICKIT pattern information information downloaded from the general public redshift-downloads S3 bucket
- AWS Lambda capabilities utilizing PyMySQL to load information into Aurora MySQL
- AWS Glue 5.0 job (PySpark) to learn tables from Aurora MySQL and write them to S3 Tables in Apache Iceberg format
- Amazon S3 Tables because the goal storage for the migrated Iceberg tables
- Amazon VPC with non-public subnets and VPC endpoints for Amazon S3, S3 Tables, AWS Glue, Secrets and techniques Supervisor, AWS Safety Token Service (AWS STS), CloudWatch Logs, and CloudFormation
Listed below are some benefits of this structure:
- Totally automated setup: A single CloudFormation stack provisions the required infrastructure, hundreds pattern information, and configures the ETL pipeline.
- Serverless and cost-efficient: Aurora MySQL Serverless v2 and AWS Glue each scale based mostly on demand, minimizing idle prices.
- Apache Iceberg desk format: Knowledge is saved in Apache Iceberg format, enabling ACID transactions, schema evolution, and time journey queries.
- Community isolation and credential administration: The sources run inside non-public subnets with Digital Personal Cloud (VPC) endpoints, and database credentials are managed via AWS Secrets and techniques Supervisor.
- Extensible sample: The identical strategy could be tailored for different relational databases (PostgreSQL, SQL Server) and different goal codecs supported by AWS Glue.
Conditions
To comply with alongside, you want an AWS account. In case you don’t but have an AWS account, you have to create one. The CloudFormation stack deployment takes roughly 30-45 minutes to finish and requires familiarity with Amazon S3 Tables, AWS CloudFormation, Apache Iceberg, AWS Glue, Amazon Aurora. This answer will incur AWS prices. The principle value drivers are AWS Glue ETL job runs (billed per DPU-hour, proportional to information quantity) and Amazon S3 Tables storage and request fees. Keep in mind to scrub up sources if you end up completed to keep away from pointless fees.
CloudFormation parameters
You may configure the next parameters earlier than deploying the CloudFormation stack:
| Parameter | Description | Default | Required |
| S3TableBucketName | Title of the S3 Tables bucket to create (or use current) | Sure | |
| DatabaseName | Title of the preliminary Aurora MySQL database | tickit | No |
| MasterUsername | Grasp username for Aurora MySQL | admin | No |
| VpcCidr | CIDR block for the VPC | 10.1.0.0/16 | No |
| S3TableNamespace | Namespace for S3 Tables | tickit | No |
Implementation walkthrough
The next steps stroll you thru the implementation. These steps are to deploy and check an end-to-end answer from scratch. In case you are already working a few of these elements, it’s possible you’ll skip to the related step. You can too seek advice from the aws-samples repository, sample-to-write-aurora-mysql-to-s3tables-using-glue for your entire answer.
Step 1: Deploy the CloudFormation stack
Deploy the CloudFormation template scripts/aurora-mysql-to-s3tables-stack.yaml utilizing the AWS Console or the AWS Command Line Interface (AWS CLI). Present a reputation for the S3 Tables bucket; the stack will create it robotically (or use an current one if it already exists).
To deploy utilizing the AWS Console (really useful), navigate to the AWS CloudFormation Console and use the CloudFormation template. Alternatively, to deploy utilizing the AWS CLI first add the template to an S3 bucket (the template exceeds the 51,200 byte restrict for inline –template-body), then create the stack.
The stack will robotically:
- Create the S3 Tables bucket (or use current if it already exists)
- Create a VPC with non-public subnets and VPC endpoints
- Provision an Aurora MySQL Serverless v2 cluster
- Obtain TICKIT pattern information from the general public Amazon S3 bucket
- Load the pattern information into Aurora MySQL by way of a Lambda operate utilizing PyMySQL
- Create a Glue job configured emigrate information to S3 Tables in Iceberg format
Word: The S3 Tables bucket is retained when the stack is deleted to protect your information.
Step 2: Confirm the Aurora MySQL information
Retrieve the AuroraClusterEndpoint, DatabaseName, and SecretArn values from the CloudFormation stack, make a remark of the AuroraClusterEndpoint, DatabaseName, and SecretArn. You may navigate to the Amazon Aurora Console, select the Question Editor, and enter the values from the CloudFormation stack to attach. You can too select your most popular methodology of connecting to an Amazon Aurora DB cluster.
Use the AWS CLI to retrieve the stack outputs: –
Then run the next SQL instructions to confirm the information load:
Step 3: Run the Glue job
Navigate to the AWS Glue Console, select ETL jobs beneath Knowledge Integration and ETL from the left panel. Choose the AWS Glue job mysql-tickit-to-iceberg-job and select Run job to start out execution. You can too begin the ETL job utilizing the AWS CLI:
The AWS Glue job performs the next operations for every of the seven TICKIT tables:
- Reads the desk from Aurora MySQL via the native MYSQL Glue connection
- Converts the information to a Spark DataFrame
- Creates the Iceberg desk within the S3 Tables namespace utilizing CREATE TABLE IF NOT EXISTS with the USING ICEBERG clause
- Inserts the information utilizing INSERT INTO (or INSERT OVERWRITE if the desk already exists)
- Verifies the report depend and shows pattern information
Step 4: Confirm the outcomes
After the AWS Glue job completes, confirm that the tables have been created in your S3 Desk bucket by navigating to the Amazon S3 Console. Select Desk buckets beneath Buckets and choose your S3 Desk bucket. You can too confirm utilizing the AWS CLI:
Choose a desk from the tickit namespace and select Preview to examine the information.

You can too question the migrated tables utilizing Amazon Athena to validate the information.
Clear up sources
Keep in mind to scrub up sources while you now not want them to keep away from pointless fees.
Navigate to the CloudFormation console, seek for your stack and select Delete. Alternatively, use the AWS CLI:
The S3 Tables bucket is retained by default. To delete it, use the Amazon S3 console or the AWS CLI to take away the desk bucket individually. The staging S3 bucket might be robotically emptied and deleted as a part of the stack deletion.
Abstract
On this submit, we confirmed you extract information from Amazon Aurora MySQL Serverless v2 and write it to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. By utilizing the native Iceberg help of AWS Glue and the S3 Tables REST catalog endpoint, you’ll be able to bridge the hole between relational databases and trendy lakehouse storage codecs. By automating your entire pipeline via CloudFormation, you’ll be able to shortly arrange and replicate this sample throughout a number of environments.
As AWS Glue and Amazon S3 Tables proceed to evolve, you’ll be able to benefit from future enhancements whereas sustaining this automated migration sample.
In case you have questions or solutions, depart us a remark.
