[HTML payload içeriği buraya]
32.4 C
Jakarta
Tuesday, March 24, 2026

Extract information from Amazon Aurora MySQL to Amazon S3 Tables in Apache Iceberg format


In case you handle information in Amazon Aurora MySQL-Appropriate Version and wish to make it obtainable for analytics, machine studying (ML), or cross-service querying in a contemporary lakehouse format, you’re not alone.

Organizations usually must run analytics, construct ML fashions, or be part of information throughout a number of sources. These are examples of workloads that may be resource-intensive and impractical to run instantly in opposition to a transactional database. By extracting your Aurora MySQL information into Amazon S3 Tables in Apache Iceberg format, you’ll be able to offload analytical queries out of your manufacturing database with out impacting its efficiency, whereas storing information in a completely managed Iceberg desk retailer optimized for analytics. Constructed on the open Apache Iceberg commonplace, Amazon Easy Storage Service (Amazon S3) Desk information is queryable from engines like Amazon Athena, Amazon Redshift Spectrum, and Apache Spark with out further information copies. You can too mix relational information with different datasets already in your information lake, enabling richer cross-domain insights.

Apache Iceberg and Amazon S3 Tables

Apache Iceberg is a broadly adopted open desk format that provides Atomicity, Consistency, Isolation, Sturdiness (ACID) transactions, schema evolution, and time journey capabilities. It permits a number of engines to work concurrently on the identical dataset, making it a well-liked selection for constructing open lakehouse architectures.

Amazon S3 Tables is a purpose-built, absolutely managed Apache Iceberg desk retailer designed for analytics workloads. It delivers as much as 3x quicker question efficiency and as much as 10x extra transactions per second in comparison with self-managed Iceberg tables. It additionally robotically compacts information and removes unreferenced information to optimize storage and efficiency.

On this submit, you learn to arrange an automatic, end-to-end answer that extracts tables from Amazon Aurora MySQL Serverless v2 and writes them to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue. All the infrastructure is deployed utilizing a single AWS CloudFormation stack.

Necessities

AWS provides zero-ETL integrations from Amazon Aurora to Amazon Redshift and Amazon SageMaker AI, enabling seamless information circulate for analytics and machine studying workloads.

Nonetheless, there isn’t but a local zero-ETL integration between Amazon Aurora and Amazon S3 Tables. This implies that organizations trying to use Amazon S3 Tables for his or her Lakehouse structure presently face a number of necessities:

  • Establishing ETL pipelines to extract information from Amazon Aurora and rework it into Apache Iceberg format
  • Configuring networking and safety for AWS Glue jobs to entry Amazon Aurora databases in non-public subnets
  • Coordinating the provisioning of supply databases, ETL pipelines, and goal desk shops
  • Managing the end-to-end workflow with out native automation

Resolution overview

On this answer, you automate the extraction of relational database tables from Amazon Aurora MySQL Serverless v2 to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. That can assist you get began and check this answer, a CloudFormation template is supplied. This template provisions the required infrastructure, hundreds pattern information, and configures the Extract, Remodel, Load (ETL) pipeline. You may adapt this template to your personal state of affairs.

Solution overview

Pattern information

This answer makes use of the TICKIT pattern database, a widely known dataset utilized in Amazon Redshift documentation. The TICKIT information fashions a fictional ticket gross sales system with seven interrelated tables: customers, venue, class, date, occasion, itemizing, and gross sales. The dataset is publicly obtainable as talked about within the Amazon Redshift Getting Began Information.

Resolution circulate

The answer circulate as proven within the earlier structure diagram:

  1. An AWS Lambda operate downloads the TICKIT pattern dataset (a fictional ticket gross sales system utilized in Amazon Redshift documentation) from a public Amazon S3 bucket to a staging S3 bucket.
  2. A second Lambda operate, utilizing PyMySQL (a Python MySQL consumer library), hundreds the staged information information into the Aurora MySQL Serverless v2 database utilizing LOAD DATA LOCAL INFILE.
  3. The AWS Glue job reads seven TICKIT tables from Aurora MySQL via a local MySQL connection and writes them to Amazon S3 Tables in Apache Iceberg format utilizing the S3 Tables REST catalog endpoint with SigV4 authentication.
  4. You may question the migrated information in S3 Tables utilizing Amazon Athena.

The answer consists of the next key elements:

  1. Amazon Aurora MySQL Serverless v2 because the supply relational database containing the TICKIT pattern dataset (customers, venue, class, date, occasion, itemizing, and gross sales tables)
  2. AWS Secrets and techniques Supervisor to retailer the Aurora MySQL database credentials securely
  3. Amazon S3 staging bucket for the TICKIT pattern information information downloaded from the general public redshift-downloads S3 bucket
  4. AWS Lambda capabilities utilizing PyMySQL to load information into Aurora MySQL
  5. AWS Glue 5.0 job (PySpark) to learn tables from Aurora MySQL and write them to S3 Tables in Apache Iceberg format
  6. Amazon S3 Tables because the goal storage for the migrated Iceberg tables
  7. Amazon VPC with non-public subnets and VPC endpoints for Amazon S3, S3 Tables, AWS Glue, Secrets and techniques Supervisor, AWS Safety Token Service (AWS STS), CloudWatch Logs, and CloudFormation

Listed below are some benefits of this structure:

  • Totally automated setup: A single CloudFormation stack provisions the required infrastructure, hundreds pattern information, and configures the ETL pipeline.
  • Serverless and cost-efficient: Aurora MySQL Serverless v2 and AWS Glue each scale based mostly on demand, minimizing idle prices.
  • Apache Iceberg desk format: Knowledge is saved in Apache Iceberg format, enabling ACID transactions, schema evolution, and time journey queries.
  • Community isolation and credential administration: The sources run inside non-public subnets with Digital Personal Cloud (VPC) endpoints, and database credentials are managed via AWS Secrets and techniques Supervisor.
  • Extensible sample: The identical strategy could be tailored for different relational databases (PostgreSQL, SQL Server) and different goal codecs supported by AWS Glue.

Conditions

To comply with alongside, you want an AWS account. In case you don’t but have an AWS account, you have to create one. The CloudFormation stack deployment takes roughly 30-45 minutes to finish and requires familiarity with Amazon S3 Tables, AWS CloudFormation, Apache Iceberg, AWS Glue, Amazon Aurora. This answer will incur AWS prices. The principle value drivers are AWS Glue ETL job runs (billed per DPU-hour, proportional to information quantity) and Amazon S3 Tables storage and request fees. Keep in mind to scrub up sources if you end up completed to keep away from pointless fees.

CloudFormation parameters

You may configure the next parameters earlier than deploying the CloudFormation stack:

ParameterDescriptionDefaultRequired
S3TableBucketNameTitle of the S3 Tables bucket to create (or use current)Sure
DatabaseNameTitle of the preliminary Aurora MySQL databasetickitNo
MasterUsernameGrasp username for Aurora MySQLadminNo
VpcCidrCIDR block for the VPC10.1.0.0/16No
S3TableNamespaceNamespace for S3 TablestickitNo

Implementation walkthrough

The next steps stroll you thru the implementation. These steps are to deploy and check an end-to-end answer from scratch. In case you are already working a few of these elements, it’s possible you’ll skip to the related step. You can too seek advice from the aws-samples repository, sample-to-write-aurora-mysql-to-s3tables-using-glue for your entire answer.

Step 1: Deploy the CloudFormation stack

Deploy the CloudFormation template scripts/aurora-mysql-to-s3tables-stack.yaml utilizing the AWS Console or the AWS Command Line Interface (AWS CLI). Present a reputation for the S3 Tables bucket; the stack will create it robotically (or use an current one if it already exists).

To deploy utilizing the AWS Console (really useful), navigate to the AWS CloudFormation Console and use the CloudFormation template. Alternatively, to deploy utilizing the AWS CLI first add the template to an S3 bucket (the template exceeds the 51,200 byte restrict for inline –template-body), then create the stack.

# Add the template to S3
aws s3 cp scripts/aurora-mysql-to-s3tables-stack.yaml 
  s3://<your-s3-bucket>/aurora-mysql-to-s3tables-stack.yaml 
  --region <your-region>

# Create the stack utilizing the S3 template URL
aws cloudformation create-stack 
  --stack-name aurora-mysql-tickit-stack 
  --template-url https://<your-s3-bucket>.s3.<your-region>.amazonaws.com/aurora-mysql-to-s3tables-stack.yaml 
  --parameters 
    ParameterKey=S3TableBucketName,ParameterValue=<your-s3-table-bucket-name> 
  --capabilities CAPABILITY_NAMED_IAM 
  --region <your-region>

The stack will robotically:

  • Create the S3 Tables bucket (or use current if it already exists)
  • Create a VPC with non-public subnets and VPC endpoints
  • Provision an Aurora MySQL Serverless v2 cluster
  • Obtain TICKIT pattern information from the general public Amazon S3 bucket
  • Load the pattern information into Aurora MySQL by way of a Lambda operate utilizing PyMySQL
  • Create a Glue job configured emigrate information to S3 Tables in Iceberg format

Word: The S3 Tables bucket is retained when the stack is deleted to protect your information.

Step 2: Confirm the Aurora MySQL information

Retrieve the AuroraClusterEndpoint, DatabaseName, and SecretArn values from the CloudFormation stack, make a remark of the AuroraClusterEndpoint, DatabaseName, and SecretArn. You may navigate to the Amazon Aurora Console, select the Question Editor, and enter the values from the CloudFormation stack to attach. You can too select your most popular methodology of connecting to an Amazon Aurora DB cluster.

Use the AWS CLI to retrieve the stack outputs: –

aws cloudformation describe-stacks --stack-name aurora-mysql-tickit-stack --region <your-region> --query "Stacks[0].Outputs"

Then run the next SQL instructions to confirm the information load:

-- Confirm if the tables are created
SELECT * FROM information_schema.tables WHERE table_schema="tickit";

-- Confirm if the information is loaded
SELECT 'customers' AS table_name, COUNT(*) AS record_count FROM tickit.customers
UNION ALL SELECT 'venue', COUNT(*) FROM tickit.venue
UNION ALL SELECT 'class', COUNT(*) FROM tickit.class
UNION ALL SELECT 'date', COUNT(*) FROM tickit.date
UNION ALL SELECT 'occasion', COUNT(*) FROM tickit.occasion
UNION ALL SELECT 'itemizing', COUNT(*) FROM tickit.itemizing
UNION ALL SELECT 'gross sales', COUNT(*) FROM tickit.gross sales;

Step 3: Run the Glue job

Navigate to the AWS Glue Console, select ETL jobs beneath Knowledge Integration and ETL from the left panel. Choose the AWS Glue job mysql-tickit-to-iceberg-job and select Run job to start out execution. You can too begin the ETL job utilizing the AWS CLI:

aws glue start-job-run --job-name mysql-tickit-to-iceberg-job --region <your-region>

The AWS Glue job performs the next operations for every of the seven TICKIT tables:

  • Reads the desk from Aurora MySQL via the native MYSQL Glue connection
  • Converts the information to a Spark DataFrame
  • Creates the Iceberg desk within the S3 Tables namespace utilizing CREATE TABLE IF NOT EXISTS with the USING ICEBERG clause
  • Inserts the information utilizing INSERT INTO (or INSERT OVERWRITE if the desk already exists)
  • Verifies the report depend and shows pattern information

Step 4: Confirm the outcomes

After the AWS Glue job completes, confirm that the tables have been created in your S3 Desk bucket by navigating to the Amazon S3 Console. Select Desk buckets beneath Buckets and choose your S3 Desk bucket. You can too confirm utilizing the AWS CLI:

aws s3tables list-tables 
  --table-bucket-arn arn:aws:s3tables:<your-region>:<your-account-id>:bucket/<your-s3-table-bucket-name> 
  --namespace tickit 
  --region <your-region>

Choose a desk from the tickit namespace and select Preview to examine the information.

Preview S3 data

You can too question the migrated tables utilizing Amazon Athena to validate the information.

Clear up sources

Keep in mind to scrub up sources while you now not want them to keep away from pointless fees.

Navigate to the CloudFormation console, seek for your stack and select Delete. Alternatively, use the AWS CLI:

aws cloudformation delete-stack --stack-name aurora-mysql-tickit-stack --region <your-region>

The S3 Tables bucket is retained by default. To delete it, use the Amazon S3 console or the AWS CLI to take away the desk bucket individually. The staging S3 bucket might be robotically emptied and deleted as a part of the stack deletion.

aws s3tables delete-table-bucket --table-bucket-arn arn:aws:s3tables:<your-region>:<your-account-id>:bucket/<your-s3-table-bucket-name> --region <your-region>

Abstract

On this submit, we confirmed you extract information from Amazon Aurora MySQL Serverless v2 and write it to Amazon S3 Tables in Apache Iceberg format utilizing AWS Glue 5.0. By utilizing the native Iceberg help of AWS Glue and the S3 Tables REST catalog endpoint, you’ll be able to bridge the hole between relational databases and trendy lakehouse storage codecs. By automating your entire pipeline via CloudFormation, you’ll be able to shortly arrange and replicate this sample throughout a number of environments.

As AWS Glue and Amazon S3 Tables proceed to evolve, you’ll be able to benefit from future enhancements whereas sustaining this automated migration sample.

In case you have questions or solutions, depart us a remark.


Concerning the authors

Kunal Ghosh

Kunal Ghosh

Kunal is a Sr. Options Architect at AWS. He’s captivated with constructing environment friendly and efficient options on AWS, particularly involving generative AI, analytics, information science, and machine studying. Apart from household time, he likes studying, swimming, biking, and watching films.

Arghya Banerjee

Arghya Banerjee

Arghya is a Sr. Options Architect at AWS within the San Francisco Bay Space, centered on serving to prospects undertake and use the AWS Cloud. He’s centered on massive information, information lakes, streaming and batch analytics providers, and generative AI applied sciences.

Indranil Banerjee

Indranil Banerjee

Indranil is a Sr. Options Architect at AWS within the San Francisco Bay Space, centered on serving to prospects within the hi-tech and semi-conductor sectors resolve advanced enterprise issues utilizing the AWS Cloud. His particular pursuits are within the areas of legacy modernization and migration, constructing analytics platforms and serving to prospects undertake leading edge applied sciences resembling generative AI.

Vipan Kumar

Vipan Kumar

Vipan is a Sr. Options Architect at AWS, the place he works with strategic prospects. He has in depth expertise in machine studying and generative AI. With a background in utility improvement, he’s captivated with designing and constructing enterprise purposes for the cloud.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles