[HTML payload içeriği buraya]
25.4 C
Jakarta
Saturday, June 7, 2025

Utilizing AWS Glue Information Catalog views with Apache Spark in EMR Serverless and Glue 5.0


The AWS Glue Information Catalog has expanded its Information Catalog views characteristic, and now helps Apache Spark environments along with Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it potential to create, share, and question multi-engine SQL views throughout Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, in addition to Athena and Amazon Redshift Spectrum. The multi-dialect views empower information groups to create SQL views one time and question them via supported engines—whether or not it’s Athena for ad-hoc analytics, Amazon Redshift for information warehousing, or Spark for large-scale information processing. This cross-engine compatibility means information engineers can give attention to constructing information merchandise slightly than managing a number of view definitions or advanced permission schemes. Utilizing AWS Lake Formation permissions, organizations can share these views throughout the identical AWS account, throughout completely different AWS accounts, and with AWS IAM Id Heart customers and teams, with out granting direct entry to the underlying tables. Options of Lake Formation resembling fine-grained entry management (FGAC) utilizing Lake Formation-tag primarily based entry management (LF-TBAC) may be utilized to Information Catalog views, enabling scalable sharing and entry management throughout organizations.

In an earlier weblog put up, we demonstrated the creation of Information Catalog views utilizing Athena, including a SQL dialect for Amazon Redshift, and querying the view utilizing Athena and Amazon Redshift. On this put up, we information you thru the method of making a Information Catalog view utilizing EMR Serverless, including the SQL dialect to the view for Athena, sharing it with one other account utilizing LF-Tags, after which querying the view within the recipient account utilizing a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the flexibility and cross-account capabilities of Information Catalog views and entry via varied AWS analytics providers.

Advantages of Information Catalog views

The next are key advantages of Information Catalog views for enterprise options:

  • Focused information sharing and entry management – Information Catalog views, mixed with the sharing capabilities of Lake Formation, allow organizations to offer particular information subsets to completely different groups or departments with out duplicating information. For instance, a retail firm can create views that present gross sales information to regional managers whereas proscribing entry to delicate buyer data. By making use of LF-TBAC to those views, corporations can effectively handle information entry throughout massive, advanced organizational constructions, sustaining compliance with information governance insurance policies whereas selling data-driven decision-making.
  • Multi-service analytics integration – The flexibility to create a view in a single analytics service and question it throughout Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down information silos and promotes a unified analytics method. This characteristic permits companies to make use of the strengths of various providers for varied analytics wants. As an illustration, a monetary establishment might create a view of transaction information and use Athena for ad-hoc queries, Amazon Redshift for advanced aggregations, and EMR Serverless for large-scale information processing—all with out shifting or duplicating the info. This flexibility accelerates insights and improves useful resource utilization throughout the analytics stack.
  • Centralized auditing and compliance – With views saved within the central Information Catalog, companies can preserve a complete audit path of information entry throughout related accounts utilizing AWS CloudTrail logs. This centralization is essential for industries with strict regulatory necessities, resembling healthcare or finance. Compliance officers can seamlessly monitor and report on information entry patterns, detect uncommon actions, and show adherence to information safety laws like GDPR or HIPAA. This centralized method simplifies compliance processes and reduces the danger of regulatory violations.

These capabilities of Information Catalog views present highly effective options for companies to boost information governance, enhance analytics effectivity, and preserve sturdy compliance measures throughout their information ecosystem.

Resolution overview

An instance firm has a number of datasets containing particulars of their prospects’ buy particulars combined with personally identifiable data (PII) information. They categorize their datasets primarily based on sensitivity of the data. The information steward needs to share a subset of their most popular prospects information for additional evaluation downstream by their information engineering workforce.

To show this use case, we use pattern Apache Iceberg tables buyer and customer_address. We create a Information Catalog view from these two tables to filter by most popular prospects. We then use LF-Tags to share restricted columns of this view to the downstream engineering workforce. The answer is represented within the following diagram.

arch diagram

Conditions

To implement this resolution, you want two AWS accounts with an AWS Id and Entry Administration (IAM) admin function. We use the function to run the supplied AWS CloudFormation templates and likewise use the identical IAM roles added as Lake Formation administrator.

Arrange infrastructure within the producer account

We offer a CloudFormation template that deploys the next assets and completes the info lake setup:

  • Two Amazon Easy Storage Service (Amazon S3) buckets: one for scripts, logs, and question outcomes, and one for the info lake storage.
  • Lake Formation administrator and catalog settings. The IAM admin function that you just present is registered as Lake Formation administrator. Cross-account sharing model is ready to 4. Default permissions for newly created databases and tables is ready to make use of Lake Formation permissions solely.
    data catalog settings
  • An IAM function with learn, write, and delete permissions on the info lake bucket objects. The information lake bucket is registered with Lake Formation utilizing this IAM function.
    data lake locations
  • An AWS Glue database for the info lake.
  • Lake Formation tags. These tags are hooked up to the database.
    lf-tags
  • CSV and Iceberg format tables within the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are saved within the consumer account’s information lake bucket.
  • An Athena workgroup.
  • An IAM function and an AWS Lambda perform to run Athena queries. Athena queries are run within the Athena workgroup to insert information from CSV tables to Iceberg tables. Related Lake Formation permissions are granted to the Lambda function.
    lf-tables
  • An EMR Studio and associated digital non-public cloud (VPC), subnet, routing desk, safety teams, and EMR Studio service IAM function.
  • An IAM function with insurance policies for the EMR Studio runtime. Related Lake Formation permissions are granted to this function on the Iceberg tables. This function will probably be used because the definer function to create the Information Catalog view. A definer function is the IAM function with obligatory permissions to entry the referenced tables, and runs the SQL assertion that defines the view.

Full the next steps in your producer AWS account:

  1. Check in to the AWS Administration Console as an IAM administrator function.
  2. Launch the CloudFormation stack.

Permit roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

  1. Should you’re utilizing the producer account in Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime function GlueViewBlog-EMRStudio-RuntimeRole.
    data permissions

Create an EMR Serverless software

Full the next steps to create an EMR Serverless software in your EMR Studio:

  1. On the Amazon EMR console, beneath EMR Studio within the navigation pane, select Studios.
  2. Select GlueViewBlog-emrstudio and select the URL hyperlink of the Studio to open it.
    glueviewblog-emrstudio
  3. On the EMR Studio dashboard, select Create software.
    emr-studio-dashboard

You’ll be directed to the Create software web page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless software.

  1. Below Utility settings, present the next data:
    1. For Title, enter a reputation (for instance, emr-glueview-application).
    2. For Kind, select Spark.
    3. For Launch model, select emr-7.8.0.
    4. For Structure, select x86_64.
  2. Below Utility setup choices, choose Use customized settings.
  3. Below Interactive endpoint, choose Allow endpoint for EMR studio.
  4. Below Further configurations, for Metastore configuration, choose Use AWS Glue Information Catalog as metastore, then choose Use Lake Formation for fine-grained entry management.
  5. Below Community connections, select emrs-vpc for VPC, enter any two non-public subnets, and enter emr-serverless-sg for Safety teams.
  6. Select Create and begin the appliance.

Create an EMR Workspace

Full the next steps to create an EMR Workspace:

  1. On the EMR Studio console, select Workspaces within the navigation pane and select Create Workspace.
  2. Enter a Workspace identify (for instance, emrs-glueviewblog-workspace).
  3. Go away all different settings as default and select Create Workspace.
  4. Select Launch Workspace. Your browser would possibly request to permit pop-up permissions for the primary time launching the Workspace.
  5. After the Workspace is launched, within the navigation pane, select Compute.
  6. For Compute sort, choose EMR Serverless software and enter emr-glueview-application for the appliance and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime function.
  7. Ensure the kernel hooked up to the Workspace is PySpark.

Create a Information Catalog view and confirm

Full the next steps:

  1. Obtain the pocket book glueviewblog_producer.ipynb. The code creates a Information Catalog view customer_nonpii_view from the 2 Iceberg tables, customer_iceberg and customer_address_iceberg, within the database glueviewblog_<account-id>_db.
  2. In your EMR Workspace emrs-glueviewblog-workspace, go to the File browser part and select Add information.
  3. Add glueviewblog_producer.ipynb.
  4. Replace the info lake bucket identify, AWS account ID, and AWS Area to match your assets.
  5. Replace the database_name, table1_name, and table2_name to match your assets.
  6. Save the pocket book.
  7. Select the double arrow icon to restart the kernel and rerun the pocket book.

The Information Catalog view customer_nonpii_view is created and verified.

  1. Within the navigation pane on the Lake Formation console, beneath Information Catalog, select Views.
  2. Select the brand new view customer_nonpii_view.
  3. On the SQL definitions tab, confirm EMR with Apache Spark exhibits up for Engine identify.
  4. Select the tab LF-Tags. The view ought to present the LF-Tag sensitivity=pii-confidential inherited from the database.
  5. Select Edit LF-Tags.
  6. On the Values dropdown menu, select confidential to overwrite the Information Catalog view’s key worth of sensitivity from pii-confidential.
  7. Select Save.

With this, we have now created a non-PII view to share with the info engineering workforce from the datasets that has PII data of shoppers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime function GlueViewBlog-EMRStudio-RuntimeRole, the Admin can have solely describe permissions on it as a database creator and Lake Formation administrator. On this step, the Admin will grant itself alter permissions on the view, to be able to add the Athena SQL dialect to the view.

  1. On the Lake Formation console, within the navigation pane, select Information permissions.
  2. Select Grant and supply the next data:
    1. For Principals, enter Admin.
    2. For LF-Tags or catalog assets, choose Assets matched by LF-Tags.
    3. For Key, select sensitivity.
    4. For Values, select confidential and pii-confidential.
    5. Below Database permissions, choose Tremendous for Database permissions and Grantable permissions.
    6. Below Desk permissions, choose Tremendous for Desk permissions and Grantable permissions.
    7. Select Grant.
  3. Confirm the LF-Tags primarily based permissions the Admin.
  4. Open the Athena question editor, select the Workgroup GlueViewBlogWorkgroup and select the AWS Glue database glueviewblog_<accountID>_db.
  5. Run the next question. Change <accountID> along with your account ID.
    ALTER VIEW glueviewblog_<accountID>_db.customer_nonpii_view ADD DIALECT
    AS
    choose c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
    from glueviewblog__<accountID>_db.customer_iceberg, glueviewblog__<accountID>_db.customer_address_iceberg
    the place c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

  6. Confirm the Athena dialect by working a preview on the view.
  7. On the Lake Formation console, confirm the SQL dialects on the view customer_nonpii_view.

Share the view to the patron account

Full the next steps to share the Information Catalog view to the patron account:

  1. On the Lake Formation console, within the navigation pane, select Information permissions.
  2. Select Grant and supply the next data:
    1. For Principals, choose Exterior accounts and enter the patron account ID.
    2. For LF-Tags or catalog assets, choose Assets matched by LF-Tags.
    3. For Key, select sensitivity.
    4. For Values, select confidential.
    5. Below Database permissions, choose Describe for Database permissions and Grantable permissions.
    6. Below Desk permissions, choose Describe and Choose for Desk permissions and Grantable permissions.
    7. Select Grant.
  3. Confirm granted permissions on the Information permissions web page.

With this, the producer account information steward has created a Information Catalog view of a subset of information from two tables of their Information Catalog, utilizing the EMR runtime function because the definer function. They’ve shared it to their analytics account utilizing LF-Tags to run additional processing of the info downstream.

Arrange infrastructure within the shopper account

We offer a CloudFormation template to deploy the next assets and arrange the info lake as follows:

  • An S3 bucket for Amazon EMR and AWS Glue logs
  • Lake Formation administrator and catalog settings much like the producer account setup
  • An AWS Glue database for the info lake
  • An EMR Studio and associated VPC, subnet, routing desk, safety teams, and EMR Studio service IAM function
  • An IAM function with insurance policies for the EMR Studio runtime

Full the next steps in your shopper AWS account:

  1. Check in to the console as an IAM administrator function.
  2. Launch the CloudFormation stack.

Permit roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

  1. Should you’re utilizing the patron account Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime function GlueViewBlog-EMRStudio-Client-RuntimeRole.

Settle for AWS RAM shares within the shopper account

Now you can log in to the AWS shopper account and settle for the AWS RAM invitation:

  1. Open the AWS RAM console with the IAM function that has AWS RAM entry.
  2. Within the navigation pane, select Useful resource shares beneath Shared with me.

It is best to see two pending useful resource shares from the producer account.

  1. Settle for each invites.

Create a useful resource hyperlink for the shared view

To entry the view that was shared by the producer AWS account, you want to create a useful resource hyperlink within the shopper AWS account. A useful resource hyperlink is a Information Catalog object that may be a hyperlink to a neighborhood or shared database, desk, or view. After you create a useful resource hyperlink to a view, you should use the useful resource hyperlink identify wherever you’ll use the view identify. Moreover, you’ll be able to grant permission on the useful resource hyperlink to the job runtime function GlueViewBlog-EMRStudio-Client-RuntimeRole to entry the view via EMR Serverless Spark.

To create a useful resource hyperlink, full the next steps:

  1. Open the Lake Formation console because the Lake Formation information lake administrator within the shopper account.
  2. Within the navigation pane, select Tables.
  3. Select Create and Useful resource hyperlink.
  4. For Useful resource hyperlink identify, enter the identify of the useful resource hyperlink (for instance, customer_nonpii_view_rl).
  5. For Database, select the glueviewblog_customer_<accountID>_db database.
  6. For Shared desk area, select the Area of the shared desk.
  7. For Shared desk, select customer_nonpii_view.
  8. Select Create.

Grant permissions on the database to the EMR job runtime function

Full the next steps to grant permissions on the database glueviewblog_customer_<accountID>_db to the EMR job runtime function:

  1. On the Lake Formation console, within the navigation pane, select Databases.
  2. Choose the database glueviewblog_customer_<accountID>_db and on the Actions menu, select Grant.
  3. Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  4. Within the Database permissions part, choose Describe.
  5. Select Grant.

Grant permissions on the useful resource hyperlink to the EMR job runtime function

Full the next steps to grant permissions on the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime function:

  1. On the Lake Formation console, within the navigation pane, select Tables.
  2. Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant.
  3. Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  4. Within the Useful resource hyperlink permissions part, choose Describe for Useful resource hyperlink permissions.
  5. Select Grant.

This permits the EMR Serverless job runtime roles to explain the useful resource hyperlink. We don’t make any picks for grantable permissions as a result of runtime roles shouldn’t be capable to grant permissions to different ideas.

Grant permissions on the goal for the useful resource hyperlink to the EMR job runtime function

Full the next steps to grant permissions on the goal for the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime function:

  1. On the Lake Formation console, within the navigation pane, select Tables.
  2. Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant on track.
  3. Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  4. Within the View permissions part, choose Choose and Describe.
  5. Select Grant.

Arrange an EMR Serverless software and Workspace within the shopper account

Repeat the steps to create an EMR Serverless software within the shopper account.

Repeat the steps to create a Workspace within the shopper account. For Compute sort, choose EMR Serverless software and enter emr-glueview-application for the appliance and GlueViewBlog-EMRStudio-Client-RuntimeRole because the runtime function.

Confirm entry utilizing interactive notebooks from EMR Studio

Full the next steps to confirm entry in EMR Studio:

  1. Obtain the pocket book glueviewblog_emr_consumer.ipynb. The code runs a choose assertion on the view shared from the producer.
  2. In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser part and select Add information.
  3. Add glueviewblog_emr_consumer.ipynb.
  4. Replace the info lake bucket identify, AWS account ID, and Area to match your assets.
  5. Replace the database to match your assets.
  6. Save the pocket book.
  7. Select the double arrow icon to restart the kernel with PySpark kernel and rerun the pocket book.

Confirm entry utilizing interactive notebooks from AWS Glue Studio

Full the next steps to confirm entry utilizing AWS Glue Studio:

  1. Obtain the pocket book glueviewblog_glue_consumer.ipynb
  2. Open the AWS Glue Studio console.
  3. Select Pocket book after which select Add pocket book.
  4. Add the pocket book glueviewblog_glue_consumer.ipynb.
  5. For IAM function, select GlueViewBlog-EMRStudio-Client-RuntimeRole.
  6. Select Create pocket book.
  7. Replace the info lake bucket identify, AWS account ID, and Area to match your assets.
  8. Replace the database to match your assets.
  9. Save the pocket book.
  10. Run all of the cells to confirm fine-grained entry.

Confirm entry utilizing the Athena question editor

As a result of the view from the producer account was shared to the patron account, the Lake Formation administrator can have entry to the view within the producer account. Additionally, as a result of the lake admin function created the useful resource hyperlink pointing to the view, it would even have entry to the useful resource hyperlink. Go to the Athena question editor and run a easy choose question on the useful resource hyperlink.

The analytics workforce within the shopper account was in a position to entry a subset of the info from a enterprise information producer workforce, utilizing their analytics instruments of alternative.

Clear up

To keep away from incurring ongoing prices, clear up your assets:

  1. In your shopper account, delete AWS Glue pocket book, cease and delete the EMR software, after which delete EMR Workspace.
  2. In your shopper account, delete the CloudFormation stack. This could take away the assets launched by the stack.
  3. In your producer account, log in to the Lake Formation console and revoke the LF-Tags primarily based permissions you had granted to the patron account.
  4. In your producer account, cease and delete the EMR software after which delete the EMR Workspace.
  5. In your producer account, delete the CloudFormation stack. This could delete the assets launched by the stack.
  6. Assessment and clear up any further AWS Glue and Lake Formation assets and permissions.

Conclusion

On this put up, we demonstrated a robust, enterprise-grade resolution for cross-account information sharing and evaluation utilizing AWS providers. We walked you thru how you can do the next key steps:

  • Create a Information Catalog view utilizing Spark in EMR Serverless inside one AWS account
  • Securely share this view with one other account utilizing LF-TBAC
  • Entry the shared view within the recipient account utilizing Spark in each EMR Serverless and AWS Glue ETL
  • Implement this resolution with Iceberg tables (it’s additionally suitable different open desk codecs like Apache Hudi and Delta Lake)

The answer method with multi-dialect information catalog views supplied on this put up is especially precious for enterprises trying to modernize their information infrastructure whereas optimizing prices, enhance cross-functional collaboration whereas enhancing information governance, and speed up enterprise insights whereas sustaining management over delicate data.

Confer with the next details about Information Catalog views with particular person analytics providers, and check out the answer. Tell us your suggestions and questions within the feedback part.


Concerning the Authors

Aarthi Srinivasan is a Senior Huge Information Architect with Amazon SageMaker Lakehouse. As a part of the SageMaker Lakehouse workforce, she works with AWS prospects and companions to architect lake home options, improve product options, and set up greatest practices for information governance.

Praveen Kumar is an Analytics Options Architect at AWS with experience in designing, constructing, and implementing fashionable information and analytics platforms utilizing cloud-based providers. His areas of curiosity are serverless know-how, information governance, and data-driven AI purposes.

Dhananjay Badaya is a Software program Developer at AWS, specializing in distributed information processing engines together with Apache Spark and Apache Hadoop. As a member of the Amazon EMR workforce, he focuses on designing and implementing enterprise governance options for EMR Spark.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles