[HTML payload içeriği buraya]
34.4 C
Jakarta
Tuesday, May 12, 2026

Introducing AWS Glue 5.1 for Apache Spark


AWS Glue is a serverless, scalable information integration service that makes it easy to find, put together, transfer, and combine information from a number of sources. AWS not too long ago introduced Glue 5.1, a brand new model of AWS Glue that accelerates information integration workloads in AWS. AWS Glue 5.1 upgrades the Spark engines to Apache Spark 3.5.6, supplying you with newer Spark launch together with the newer dependent libraries so you possibly can develop, run, and scale your information integration workloads and get insights sooner.

On this publish, we describe what’s new in AWS Glue 5.1, key highlights on Spark and associated libraries, and find out how to get began on AWS Glue 5.1.

What’s new in AWS Glue 5.1

The next updates are in AWS Glue 5.1:

Runtime and library upgrades

AWS Glue 5.1 upgrades the runtime to Spark 3.5.6, Python 3.11, and Scala 2.12.18 with new enhancements from the open supply model. AWS Glue 5.1 additionally updates assist for open desk format libraries to Apache Hudi 1.0.2, Apache Iceberg 1.10.0, and Delta Lake 3.3.2 so you possibly can resolve superior use circumstances round efficiency, price, governance, and privateness in your information lakes.

Assist for brand spanking new Apache Iceberg options

AWS Glue 5.1 provides assist for Apache Iceberg Materialized View, and Apache Iceberg format model 3.0. AWS Glue 5.1 additionally provides assist for information writes into Iceberg and Hive tables with Spark-native fine-grained entry management with AWS Lake Formation.

Apache Iceberg Materialized View is very helpful in circumstances the place it’s essential speed up steadily run queries on giant information units by pre-computing costly aggregations. If you need to be taught extra about Apache Iceberg materialized views, consult with Introducing Apache Iceberg materialized views in AWS Glue Information Catalog.

Apache Iceberg format model 3.0 is the most recent Iceberg format model outlined in Iceberg Desk Spec. Following options are supported:

Create an Iceberg V3 format desk

To create an Iceberg V3 format desk, specify the format-version to 3 when creating the desk. The next is a pattern PySpark script: (change amzn-s3-demo-bucket together with your S3 bucket identify):

from pyspark.sql import SparkSession

s3bucket = "amzn-s3-demo-bucket" 
database = "glue51_blog_demo" 
table_name = "iceberg_v3_table_demo"

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.defaultCatalog", "glue_catalog")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.sort", "glue")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://{s3bucket}/{database}/{table_name}/")
    .getOrCreate()
)

spark.sql(f"CREATE DATABASE IF NOT EXISTS {database}")

# Create Iceberg desk with V3 format-version
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {database}.{table_name} (
        id int,
        identify string,
        age int,
        created_at timestamp
    ) USING iceberg
    TBLPROPERTIES (
        'format-version'='3',
        'write.delete.mode'='merge-on-read'
    )
""")

Emigrate from V2 format to V3, use ALTER TABLE ... SET TBLPROPERTIES to replace the format-version. The next is a pattern PySpark script:

spark.sql(f"ALTER TABLE {database}.{table_name} SET TBLPROPERTIES ('format-version'='3')")

You can not rollback from V3 to V2, so it’s essential watch out to confirm that each one your Iceberg shoppers assist Iceberg V3 format model. As soon as upgraded, older variations can not appropriately learn newer format variations, as Iceberg desk format variations should not forward-compatible.

Create a desk with Row Lineage monitoring enabled

To create a desk with Row Lineage monitoring enabled, set the desk property row-lineage to true. The next is a pattern PySpark script:

# Create Iceberg desk with row-lineage-tracking
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {database}.{table_name} (
        id int,
        identify string,
        age int,
        created_at timestamp
    ) USING iceberg
    TBLPROPERTIES (
        'format-version'='3',
        'row-lineage'='true',
        'write.delete.mode'='merge-on-read'
    )
""")

In tables with Row Lineage monitoring enabled, row IDs are managed on the metadata stage for monitoring row modifications over time and auditing.

Prolonged assist for AWS Lake Formation permissions

Fantastic-grained entry management with Lake Formation has been supported by way of native Spark DataFrames and Spark SQL in Glue 5.0 for learn operations. Glue 5.1 extends fine-grained entry management for write operations.

Full-Desk Entry (FTA) management in Apache Spark had been launched for Apache Hive and Iceberg tables in Glue 5.0. Glue 5.1 extends FTA assist for Apache Hudi tables and Delta Lake tables.

S3A by default

AWS Glue 5.1 makes use of S3A because the default S3 connector. This alteration aligns with the latest Amazon EMR adoption of S3A because the default connector and brings enhanced efficiency and superior options to Glue workloads. For extra particulars in regards to the S3A connector’s capabilities and optimizations, see Optimize Amazon EMR runtime for Apache Spark with EMR S3A.

Observe when migrating from Glue 5.0 to Glue 5.1, If each spark.hadoop.fs.s3a.endpoint and spark.hadoop.fs.s3a.endpoint.area should not set, the default area utilized by S3A is us-east-2. This will trigger points. To mitigate the problems attributable to this transformation, set the spark.hadoop.fs.s3a.endpoint.area Spark configuration when utilizing the S3A file system in AWS Glue 5.1.

Dependent library upgrades

AWS Glue 5.1 upgrades the runtime to Spark 3.5.6, Python 3.11, and Scala 2.12.18 with upgraded dependent libraries.

The next desk lists dependency upgrades:

DependencyModel in AWS Glue 5.0Model in AWS Glue 5.1
Spark3.5.43.5.6
Hadoop3.4.13.4.1
Scala2.12.182.12.18
Hive2.3.92.3.9
EMRFS2.69.02.73.0
Arrow12.0.112.0.1
Iceberg1.7.11.10.0
Hudi0.15.01.0.2
Delta Lake3.3.03.3.2
Java1717
Python3.113.11.14
boto31.34.1311.40.61
AWS SDK for Java2.29.522.35.5
AWS Glue Information Catalog Shopper4.5.04.9.0
EMR DynamoDB Connector5.6.05.7.0

The next are database connector (JDBC driver) upgrades:

DriverConnector model in AWS Glue 5.0Connector model in AWS Glue 5.1
MySQL8.0.338.0.33
Microsoft SQL Server10.2.010.2.0
Oracle Databases23.3.0.23.0923.3.0.23.09
PostgreSQL42.7.342.7.3
Amazon Redshiftredshift-jdbc42-2.1.0.29redshift-jdbc42-2.1.0.29

The next are Spark connector upgrades:

DriverConnector model in AWS Glue 5.0Connector model in AWS Glue 5.1
Amazon Redshift6.4.06.4.2
OpenSearch1.2.01.2.0
MongoDB10.3.010.3.0
Snowflake3.0.03.1.1
BigQuery0.32.20.32.2
AzureCosmos4.33.04.33.0
AzureSQL1.3.01.3.0
Vertica3.3.53.3.5

Get began with AWS Glue 5.1

You can begin utilizing AWS Glue 5.1 by way of AWS Glue Studio, the AWS Glue console, the most recent AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start out utilizing AWS Glue 5.1 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Particulars tab, select the model Glue 5.1 – Helps Spark 3.5, Scala 2, Python 3.

To start out utilizing AWS Glue 5.1 on an AWS Glue Studio pocket book or an interactive session by way of a Jupyter pocket book, set 5.1 within the %glue_version magic:

The next output reveals that the session is ready to make use of AWS Glue 5.1:

Setting Glue model to: 5.1

Spark Troubleshooting with Glue 5.1

To speed up Apache Spark troubleshooting and job efficiency optimization to your Glue 5.1 ETL jobs, you should utilize the newly launched Apache Spark troubleshooting agent. Conventional Spark troubleshooting requires in depth guide evaluation of logs, efficiency metrics, and error patterns to establish root causes and optimization alternatives. The agent simplifies this course of by way of pure language prompts, automated workload evaluation, and clever code suggestions. The agent has three primary parts: an MCP-compatible AI assistant in your improvement setting for interplay, the MCP proxy for AWS that handles safe communication between your consumer and the MCP server, and an Amazon SageMaker Unified Studio managed MCP Server (preview) that gives specialised Spark troubleshooting and improve instruments for Glue 5.1 jobs.

To arrange the agent, observe the directions to arrange the assets and MCP configuration: Setup for Apache Spark Troubleshooting agent. Then, you possibly can launch your most popular MCP consumer and use dialog to work together with the instruments for troubleshooting.

The next is an illustration on how you should utilize the Apache Spark troubleshooting agent with Kiro CLI to debug a Glue 5.1 job run.

For extra info and video walkthroughs for find out how to use the Apache Spark troubleshooting agent, please consult with Apache Spark Troubleshooting agent for Amazon EMR.

Conclusion

On this publish, we mentioned the important thing options and advantages of AWS Glue 5.1. You’ll be able to create new AWS Glue jobs on AWS Glue 5.1 or migrate your present AWS Glue jobs to profit from the enhancements.

We want to thank the assist of quite a few engineers and leaders who helped construct Glue 5.1 to assist clients with a efficiency optimized Spark runtime and ship new capabilities.


Concerning the authors

Chiho Sugimoto

Chiho is a Cloud Assist Engineer on the AWS Huge Information Assist workforce. She is keen about serving to clients construct information lakes utilizing ETL workloads. She loves planetary science and enjoys finding out the asteroid Ryugu on weekends.

Noritaka Sekiyama

Noritaka is a Principal Huge Information Architect on the AWS Analytics product workforce. He’s chargeable for designing new options in AWS merchandise, constructing software program artifacts, and offering structure steering to clients. In his spare time, he enjoys biking on his highway bike.

Peter Tsai

Peter is a Software program Growth Engineer at AWS, the place he enjoys fixing challenges within the design and efficiency of the AWS Glue runtime. In his leisure time, he enjoys mountaineering and biking.

Bo Li

Bo Li is a Senior Software program Growth Engineer on the AWS Glue workforce. He’s dedicated to designing and constructing end-to-end options to handle clients’ information analytic and processing wants with cloud-based, data-intensive and GenAI applied sciences.

Kartik Panjabi

Kartik is a Software program Growth Supervisor on the AWS Glue workforce. His workforce builds generative AI options for the Information Integration and distributed system for information integration.

Peter Manastyrny

Peter is a Product Supervisor specializing in information processing and information integration workloads at AWS. He’s engaged on making AWS Glue the very best software for constructing and working complicated built-in information pipelines.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles