[HTML payload içeriği buraya]
33.1 C
Jakarta
Monday, May 11, 2026

Measure efficiency of AWS Glue Information High quality for ETL pipelines


In recent times, knowledge lakes have grow to be a mainstream structure, and knowledge high quality validation is a essential issue to enhance the reusability and consistency of the information. AWS Glue Information High quality reduces the trouble required to validate knowledge from days to hours, and supplies computing suggestions, statistics, and insights in regards to the assets required to run knowledge validation.

AWS Glue Information High quality is constructed on DeeQu, an open supply software developed and used at Amazon to calculate knowledge high quality metrics and confirm knowledge high quality constraints and modifications within the knowledge distribution so you’ll be able to deal with describing how knowledge ought to look as a substitute of implementing algorithms.

On this submit, we offer benchmark outcomes of working more and more complicated knowledge high quality rulesets over a predefined take a look at dataset. As a part of the outcomes, we present how AWS Glue Information High quality supplies details about the runtime of extract, rework, and cargo (ETL) jobs, the assets measured when it comes to knowledge processing items (DPUs), and how one can observe the price of working AWS Glue Information High quality for ETL pipelines by defining customized price reporting in AWS Value Explorer.

Answer overview

We begin by defining our take a look at dataset with a view to discover how AWS Glue Information High quality mechanically scales relying on enter datasets.

Dataset particulars

The take a look at dataset comprises 104 columns and 1 million rows saved in Parquet format. You may obtain the dataset or recreate it regionally utilizing the Python script supplied within the repository. If you happen to choose to run the generator script, it’s good to set up the Pandas and Mimesis packages in your Python atmosphere:

pip set up pandas mimesis

The dataset schema is a mixture of numerical, categorical, and string variables with a view to have sufficient attributes to make use of a mixture of built-in AWS Glue Information High quality rule varieties. The schema replicates a number of the commonest attributes present in monetary market knowledge similar to instrument ticker, traded volumes, and pricing forecasts.

Information high quality rulesets

We categorize a number of the built-in AWS Glue Information High quality rule varieties to outline the benchmark construction. The classes contemplate whether or not the foundations carry out column checks that don’t require row-level inspection (easy guidelines), row-by-row evaluation (medium guidelines), or knowledge sort checks, ultimately evaluating row values towards different knowledge sources (complicated guidelines). The next desk summarizes these guidelines.

Easy GuidelinesMedium GuidelinesComplicated Guidelines
ColumnCountDistinctValuesCountColumnValues
ColumnDataTypeIsCompleteCompleteness
ColumnExistSumReferentialIntegrity
ColumnNamesMatchPatternStandardDeviationColumnCorrelation
RowCountImplyRowCountMatch
ColumnLength..

We outline eight totally different AWS Glue ETL jobs the place we run the information high quality rulesets. Every job has a special variety of knowledge high quality guidelines related to it. Every job additionally has an related user-defined price allocation tag that we use to create a knowledge high quality price report in AWS Value Explorer afterward.

We offer the plain textual content definition for every ruleset within the following desk.

Job identifyEasy GuidelinesMedium GuidelinesComplicated GuidelinesVariety of GuidelinesTagDefinition
ruleset-00000dqjob:rs0
ruleset-10011dqjob:rs1Hyperlink
ruleset-53115dqjob:rs5Hyperlink
ruleset-1062210dqjob:rs10Hyperlink
ruleset-5030101050dqjob:rs50Hyperlink
ruleset-100503020100dqjob:rs100Hyperlink
ruleset-2001006040200dqjob:rs200Hyperlink
ruleset-40020012080400dqjob:rs400Hyperlink

Create the AWS Glue ETL jobs containing the information high quality rulesets

We add the take a look at dataset to Amazon Easy Storage Service (Amazon S3) and in addition two further CSV recordsdata that we’ll use to guage referential integrity guidelines in AWS Glue Information High quality (isocodes.csv and exchanges.csv) after they’ve been added to the AWS Glue Information Catalog. Full the next steps:

  1. On the Amazon S3 console, create a brand new S3 bucket in your account and add the take a look at dataset.
  2. Create a folder within the S3 bucket known as isocodes and add the isocodes.csv file.
  3. Create one other folder within the S3 bucket known as trade and add the exchanges.csv file.
  4. On the AWS Glue console, run two AWS Glue crawlers, one for every folder to register the CSV content material in AWS Glue Information Catalog (data_quality_catalog). For directions, discuss with Including an AWS Glue Crawler.

The AWS Glue crawlers generate two tables (exchanges and isocodes) as a part of the AWS Glue Information Catalog.

AWS Glue Data Catalog

Now we’ll create the AWS Id and Entry Administration (IAM) function that shall be assumed by the ETL jobs at runtime:

  1. On the IAM console, create a brand new IAM function known as AWSGlueDataQualityPerformanceRole
  2. For Trusted entity sort, choose AWS service.
  3. For Service or use case, select Glue.
  4. Select Subsequent.

AWS IAM trust entity selection

  1. For Permission insurance policies, enter AWSGlueServiceRole
  2. Select Subsequent.
    AWS IAM add permissions policies
  3. Create and fix a brand new inline coverage (AWSGlueDataQualityBucketPolicy) with the next content material. Substitute the placeholder with the S3 bucket identify you created earlier:
    {
      "Model": "2012-10-17",
      "Assertion": [
        {
          "Effect": "Allow",
          "Action": "s3:GetObject",
          "Resource": [
            "arn:aws:s3:::<your_Amazon_S3_bucket_name>/*"
          ]
        }
      ]
    }

Subsequent, we create one of many AWS Glue ETL jobs, ruleset-5.

  1. On the AWS Glue console, underneath ETL jobs within the navigation pane, select Visible ETL.
  2. Within the Create job part, select Visible ETL.x
    Overview of available jobs in AWS Glue Studio
  3. Within the Visible Editor, add a Information Supply – S3 Bucket supply node:
    1. For S3 URL, enter the S3 folder containing the take a look at dataset.
    2. For Information format, select Parquet.

    Overview of Amazon S3 data source in AWS Glue Studio

  4. Create a brand new motion node, Remodel: Consider-Information-Catalog:
  5. For Node dad and mom, select the node you created.
  6. Add the ruleset-5 definition underneath Ruleset editor.
    Data quality rules for ruleset-5
  7. Scroll to the tip and underneath Efficiency Configuration, allow Cache Information.

Enable Cache data option

  1. Below Job particulars, for IAM Position, select AWSGlueDataQualityPerformanceRole.
    Select previously created AWS IAM role
  2. Within the Tags part, outline dqjob tag as rs5.

This tag shall be totally different for every of the information high quality ETL jobs; we use them in AWS Value Explorer to evaluation the ETL jobs price.

Define dqjob tag for ruleset-5 job

  1. Select Save.
  2. Repeat these steps with the remainder of the rulesets to outline all of the ETL jobs.

Overview of jobs defined in AWS Glue Studio

Run the AWS Glue ETL jobs

Full the next steps to run the ETL jobs:

  1. On the AWS Glue console, select Visible ETL underneath ETL jobs within the navigation pane.
  2. Choose the ETL job and select Run job.
  3. Repeat for all of the ETL jobs.

Select one AWS Glue job and choose Run Job on the top right

When the ETL jobs are full, the Job run monitoring web page will show the job particulars. As proven within the following screenshot, a DPU hours column is supplied for every ETL job.

Overview of AWS Glue jobs monitoring

Overview efficiency

The next desk summarizes the length, DPU hours, and estimated prices from working the eight totally different knowledge high quality rulesets over the identical take a look at dataset. Observe that every one rulesets have been run with your complete take a look at dataset described earlier (104 columns, 1 million rows).

ETL Job TitleVariety of GuidelinesTagLength (sec)# of DPU hours# of DPUsValue ($)
ruleset-400400dqjob:rs400445.71.2410$0.54
ruleset-200200dqjob:rs200235.70.6510$0.29
ruleset-100100dqjob:rs100186.50.5210$0.23
ruleset-5050dqjob:rs50155.20.4310$0.19
ruleset-1010dqjob:rs10152.20.4210$0.18
ruleset-55dqjob:rs5150.30.4210$0.18
ruleset-11dqjob:rs1150.10.4210$0.18
ruleset-00dqjob:rs053.20.1510$0.06

The price of evaluating an empty ruleset is near zero, however it has been included as a result of it may be used as a fast take a look at to validate the IAM roles related to the AWS Glue Information High quality jobs and browse permissions to the take a look at dataset in Amazon S3. The price of knowledge high quality jobs solely begins to extend after evaluating rulesets with greater than 100 guidelines, remaining fixed beneath that quantity.

We are able to observe that the price of working knowledge high quality for the most important ruleset within the benchmark (400 guidelines) continues to be barely above $0.50.

Information high quality price evaluation in AWS Value Explorer

With a purpose to see the information high quality ETL job tags in AWS Value Explorer, it’s good to activate the user-defined price allocation tags first.

After you create and apply user-defined tags to your assets, it could actually take as much as 24 hours for the tag keys to seem in your price allocation tags web page for activation. It might then take as much as 24 hours for the tag keys to activate.

  1. On the AWS Value Explorer console, select Value Explorer Saved Reviews within the navigation pane.
  2. Select Create new report.
    Create new AWS Cost Explorer report
  3. Choose Value and utilization because the report sort.
  4. Select Create Report.
    Confirm creation of a new AWS Cost Explorer report
  5. For Date Vary, enter a date vary.
  6. For Granularity¸ select Day by day.
  7. For Dimension, select Tag, then select the dqjob tag.
    Report parameter selection in AWS Cost Explorer
  8. Below Utilized filters, select the dqjob tag and the eight tags used within the knowledge high quality rulesets (rs0, rs1, rs5, rs10, rs50, rs100, rs200, and rs400).
    Select the eight tags used to tag the data quality AWS Glue jobs
  9. Select Apply.

The Value and Utilization report shall be up to date. The X-axis reveals the information high quality ruleset tags as classes. The Value and utilization graph in AWS Value Explorer will refresh and present the whole month-to-month price of the newest knowledge high quality ETL jobs run, aggregated by ETL job.

The AWS Cost Explorer report shows the costs associated to executing the data quality AWS Glue Studio jobs

Clear up

To wash up the infrastructure and keep away from further prices, full the next steps:

  1. Empty the S3 bucket initially created to retailer the take a look at dataset.
  2. Delete the ETL jobs you created in AWS Glue.
  3. Delete the AWSGlueDataQualityPerformanceRole IAM function.
  4. Delete the customized report created in AWS Value Explorer.

Conclusion

AWS Glue Information High quality supplies an environment friendly option to incorporate knowledge high quality validation as a part of ETL pipelines and scales mechanically to accommodate growing volumes of information. The built-in knowledge high quality rule varieties provide a variety of choices to customise the information high quality checks and deal with how your knowledge ought to look as a substitute of implementing undifferentiated logic.

On this benchmark evaluation, we confirmed how common-size AWS Glue Information High quality rulesets have little or no overhead, whereas in complicated instances, the associated fee will increase linearly. We additionally reviewed how one can tag AWS Glue Information High quality jobs to make price info accessible in AWS Value Explorer for fast reporting.

AWS Glue Information High quality is typically accessible in all AWS Areas the place AWS Glue is out there. Be taught extra about AWS Glue Information High quality and AWS Glue Information Catalog in Getting began with AWS Glue Information High quality from the AWS Glue Information Catalog.


Concerning the Authors


Ruben Afonso Francos
Ruben Afonso is a International Monetary Companies Options Architect with AWS. He enjoys engaged on analytics and AI/ML challenges, with a ardour for automation and optimization. When not at work, he enjoys discovering hidden spots off the overwhelmed path round Barcelona.


Kalyan Kumar Neelampudi (KK)
Kalyan Kumar Neelampudi (KK)
is a Specialist Associate Options Architect (Information Analytics & Generative AI) at AWS. He acts as a technical advisor and collaborates with varied AWS companions to design, implement, and construct practices round knowledge analytics and AI/ML workloads. Outdoors of labor, he’s a badminton fanatic and culinary adventurer, exploring native cuisines and touring along with his associate to find new tastes and experiences.

Gonzalo Herreros
Gonzalo Herreros
is a Senior Huge Information Architect on the AWS Glue crew.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles