Organizations are more and more utilizing a multi-cloud technique to run their manufacturing workloads. We frequently see requests from prospects who’ve began their knowledge journey by constructing knowledge lakes on Microsoft Azure, to increase entry to the info to AWS providers. Clients need to use quite a lot of AWS analytics, knowledge, AI, and machine studying (ML) providers like AWS Glue, Amazon Redshift, and Amazon SageMaker to construct extra cost-efficient, performant knowledge options harnessing the energy of particular person cloud service suppliers for his or her enterprise use circumstances.
In such situations, knowledge engineers face challenges in connecting and extracting knowledge from storage containers on Microsoft Azure. Clients usually use Azure Knowledge Lake Storage Gen2 (ADLS Gen2) as their knowledge lake storage medium and retailer the info in open desk codecs like Delta tables, and need to use AWS analytics providers like AWS Glue to learn the delta tables. AWS Glue, with its skill to course of knowledge utilizing Apache Spark and join to numerous knowledge sources, is an acceptable resolution for addressing the challenges of accessing knowledge throughout a number of cloud environments.
AWS Glue is a serverless knowledge integration service that makes it easy to find, put together, and mix knowledge for analytics, ML, and software improvement. AWS Glue customized connectors will let you uncover and combine further knowledge sources, comparable to software program as a service (SaaS) purposes and your customized knowledge sources. With just some clicks, you’ll be able to seek for and subscribe to connectors from AWS Market and start your knowledge preparation workflow in minutes.
On this put up, we clarify how one can extract knowledge from ADLS Gen2 utilizing the Azure Knowledge Lake Storage Connector for AWS Glue. We particularly display the right way to import knowledge saved in Delta tables in ADLS Gen2. We offer step-by-step steerage on the right way to configure the connector, creator an AWS Glue ETL (extract, remodel, and cargo) script, and cargo the extracted knowledge into Amazon Easy Storage Service (Amazon S3).
Azure Knowledge Lake Storage Connector for AWS Glue
The Azure Knowledge Lake Storage Connector for AWS Glue simplifies the method of connecting AWS Glue jobs to extract knowledge from ADLS Gen2. It makes use of the Hadoop’s FileSystem interface and the ADLS Gen2 connector for Hadoop. The Azure Knowledge Lake Storage Connector for AWS Glue additionally consists of the hadoop-azure module, which helps you to run Apache Hadoop or Apache Spark jobs immediately with knowledge in ADLS. When the connector is added to the AWS Glue atmosphere, AWS Glue masses the library from the Amazon Elastic Container Registry (Amazon ECR) repository throughout initialization (as a connector). When AWS Glue has web entry, the Spark job in AWS Glue can learn from and write to ADLS.
With the supply of the Azure Knowledge Lake Storage Connector for AWS Glue in AWS Market, an AWS Glue connection makes certain you’ve the required packages to make use of in your AWS Glue job.
For this put up, we use the Shared Key authentication technique.
Resolution overview
On this put up, our goal is emigrate a product desk named sample_delta_table, which at the moment resides in ADLS Gen2, to Amazon S3. To perform this, we use AWS Glue, the Azure Knowledge Lake Storage Connector for AWS Glue, and AWS Secrets and techniques Supervisor to securely retailer the Azure shared key. We employed an AWS Glue serverless ETL job, configured with the connector, to determine a connection to ADLS utilizing shared key authentication over the general public web. After the desk is migrated to Amazon S3, we use Amazon Athena to question Delta Lake tables.
The next structure diagram illustrates how AWS Glue facilitates knowledge ingestion from ADLS.

Stipulations
You want the next stipulations:
Configure your ADLS Gen2 account in Secrets and techniques Supervisor
Full the next steps to create a secret in Secrets and techniques Supervisor to retailer the ADLS credentials:
- On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
- For Secret kind, choose Different kind of secret.
- Enter the important thing
accountNamefor the ADLS Gen2 storage account title. - Enter the important thing
accountKeyfor the ADLS Gen2 storage account key. - Enter the important thing container for the ADLS Gen2 container.
- Depart the remainder of the choices as default and select Subsequent.

- Enter a reputation for the key (for instance,
adlstorage_credentials). - Select Subsequent.
- Full the remainder of the steps to retailer the key.
Subscribe to the Azure Knowledge Lake Storage Connector for AWS Glue
The Azure Knowledge Lake Storage Connector for AWS Glue simplifies the method of connecting AWS Glue jobs to extract knowledge from ADLS Gen2. The connector is out there as an AWS Market providing.
Full the next steps to subscribe to the connector:
- Log in to your AWS account with the required permissions.
- Navigate to the AWS Market web page for the Azure Knowledge Lake Storage Connector for AWS Glue.
- Select Proceed to Subscribe.
- Select Proceed to Configuration after studying the EULA.

- For Fulfilment possibility, select Glue 4.0.
- For Software program model, select the most recent software program model.
- Select Proceed to Launch.

Create a customized connection in AWS Glue
After you’re subscribed to the connector, full the next steps to create an AWS Glue connection primarily based on it. This connection can be added to the AWS Glue job to verify the connector is out there and the info retailer connection info is accessible to determine a community pathway.
To create the AWS Glue connection, it’s worthwhile to activate the Azure Knowledge Lake Storage Connector for AWS Glue on the AWS Glue Studio console. After you select Proceed to Launch within the earlier steps, you’re redirected to the connector touchdown web page.
- Within the Configuration particulars part, select Utilization directions.
- Select Activate the Glue connector from AWS Glue Studio.

The AWS Glue Studio console permits the choice to both activate the connector or activate it and create the connection in a single step. For this put up, we select the second possibility.
- For Connector, verify Azure ADLS Connector for AWS Glue 4.0 is chosen.
- For Identify, enter a reputation for the connection (for instance,
AzureADLSStorageGen2Connection). - Enter an non-obligatory description.
- Select Create connection and activate connector.

The connection is now prepared to be used. The connector and connection info is seen on the Knowledge connections web page of the AWS Glue console.

Learn Delta tables from ADLS Gen2 utilizing the connector in an AWS Glue ETL job
Full the next steps to create an AWS Glue job and configure the AWS Glue connection and job parameter choices:
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Select Writer code with a script editor and select Script editor.
- Select Create script and go to the Job particulars part.
- Replace the settings for Identify and IAM position.
- Underneath Superior properties, add the AWS Glue connection AzureADLSStorageGen2Connection created in earlier steps.

- For Job parameters, add the important thing
--datalake-formatswith the worth as delta.
- Use the next script to learn the Delta desk from ADLS. Present the trail to the place you’ve Delta desk recordsdata in your Azure storage account container and the S3 bucket for writing delta recordsdata to the output S3 location.
- Select Run to start out the job.
- On the Runs tab, verify the job ran efficiently.

- On the Amazon S3 console, confirm the delta recordsdata within the S3 bucket (Delta desk path).

- Create a database and desk in Athena to question the migrated Delta desk in Amazon S3.
You possibly can accomplish this step utilizing an AWS Glue crawler. The crawler can routinely crawl your Delta desk saved in Amazon S3 and create the required metadata within the AWS Glue Knowledge Catalog. Athena can then use this metadata to question and analyze the Delta desk seamlessly. For extra info, see Crawl Delta Lake tables utilizing AWS Glue crawlers.
12. Question the Delta desk:
By following the steps outlined within the put up, you’ve efficiently migrated a Delta desk from ADLS Gen2 to Amazon S3 utilizing an AWS Glue ETL job.
Learn the Delta desk in an AWS Glue pocket book
The next are non-obligatory steps if you wish to learn the Delta desk from ADLS Gen2 in an AWS Glue pocket book:
- Create a pocket book and run the next code within the first pocket book cell to configure the AWS Glue connection and
--datalake-formatsin an interactive session:
- Run the next code in a brand new cell to learn the Delta desk saved in ADLS Gen 2. Present the trail to the place you’ve delta recordsdata in an Azure storage account container and the S3 bucket for writing delta recordsdata to Amazon S3.
Clear up
To scrub up your sources, full the next steps:
- Take away the AWS Glue job, database, desk, and connection:
- On the AWS Glue console, select Tables within the navigation pane, choose
sample_delta_table, and select Delete. - Select Databases within the navigation pane, choose
deltadb, and select Delete. - Select Connections within the navigation pane, choose
AzureADLSStorageGen2Connection, and on the Actions menu, select Delete.
- On the AWS Glue console, select Tables within the navigation pane, choose
- On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane, choose
adlstorage_credentials, and on the Actions menu, select Delete secret. - If you’re not going to make use of this connector, you’ll be able to cancel the subscription to the connector:
- On the AWS Market console, select Handle subscriptions.
- Choose the subscription for the product that you simply need to cancel, and on the Actions menu, select Cancel subscription.
- Learn the data supplied and choose the acknowledgement examine field.
- Select Sure, cancel subscription.
- On the Amazon S3 console, delete the info within the S3 bucket that you simply used within the earlier steps.
You may as well use the AWS Command Line Interface (AWS CLI) to take away the AWS Glue and Secrets and techniques Supervisor sources. Take away the AWS Glue job, database, desk, connection, and Secrets and techniques Supervisor secret with the next command:
Conclusion
On this put up, we demonstrated a real-world instance of migrating a Delta desk from Azure Delta Lake Storage Gen2 to Amazon S3 utilizing AWS Glue. We used an AWS Glue serverless ETL job, configured with an AWS Market connector, to determine a connection to ADLS utilizing shared key authentication over the general public web. Moreover, we used Secrets and techniques Supervisor to securely retailer the shared key and seamlessly combine it inside the AWS Glue ETL job, offering a safe and environment friendly migration course of. Lastly, we supplied steerage on querying the Delta Lake desk from Athena.
Check out the answer on your personal use case, and tell us your suggestions and questions within the feedback.
Concerning the Authors
Nitin Kumar is a Cloud Engineer (ETL) at Amazon Internet Providers, specialised in AWS Glue. With a decade of expertise, he excels in aiding prospects with their massive knowledge workloads, specializing in knowledge processing and analytics. He’s dedicated to serving to prospects overcome ETL challenges and develop scalable knowledge processing and analytics pipelines on AWS. In his free time, he likes to observe motion pictures and spend time along with his household.
Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru, specialised in AWS Glue and Amazon Athena. He’s captivated with serving to prospects remedy points associated to their ETL workload and implement scalable knowledge processing and analytics pipelines on AWS. In his free time, Shubham likes to spend time along with his household and journey all over the world.
Pramod Kumar P is a Options Architect at Amazon Internet Providers. With 19 years of know-how expertise and near a decade of designing and architecting connectivity options (IoT) on AWS, he guides prospects to construct options with the proper architectural tenets to satisfy their enterprise outcomes.
Madhavi Watve is a Senior Options Architect at Amazon Internet Providers, offering assist and steerage to a broad vary of shoppers to construct extremely safe, scalable, dependable, and cost-efficient purposes on the cloud. She brings over 20 years of know-how expertise in software program improvement and structure and is knowledge analytics specialist.
Swathi S is a Technical Account Supervisor with the Enterprise Assist staff in Amazon Internet Providers. She has over 6 years of expertise with AWS on massive knowledge applied sciences and focuses on analytics frameworks. She is captivated with serving to AWS prospects navigate the cloud area and enjoys helping with design and optimization of analytics workloads on AWS.











