The Amazon EMR runtime for Apache Spark is a performance-optimized runtime that’s 100% API suitable with open supply Apache Spark. It presents sooner out-of-the-box efficiency than Apache Spark by improved question plans, sooner queries, and tuned defaults. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, and Amazon EMR on AWS Outposts all use this optimized runtime, which is 4.5 occasions sooner than Apache Spark 3.5.1 and has 2.8 occasions higher price-performance primarily based on an business commonplace benchmark derived from TPC-DS at 3 TB scale (observe that our TPC-DS derived benchmark outcomes aren’t instantly comparable with official TPC-DS benchmark outcomes).
We added 35 optimizations for the reason that EOY 2022 launch, EMR 6.9, which can be included in each EMR 7.0 and EMR 7.1. These enhancements are turned on by default and are 100% API suitable with Apache Spark. A number of the enhancements since our earlier publish, Amazon EMR on EKS widens the efficiency hole, embrace:
- Spark bodily plan operator enhancements – We proceed to enhance Spark runtime efficiency by altering the operator algorithms:
- Optimized information constructions utilized in hash joins for efficiency and reminiscence necessities, permitting the usage of extra performant be a part of algorithm for extra circumstances
- Optimized sorting for partial window
- Optimized rollup operations
- Improved type algorithm for shuffle partitioning
- Optimized hash combination operator
- Extra environment friendly decimal arithmetic operations
- Aggregates primarily based on Parquet statistics
- Spark question planning enhancements – We launched new guidelines within the Spark’s Catalyst optimizer to enhance effectivity:
- Adaptively decrease redundant joins
- Adaptively determine and disable unhelpful optimizations at runtime
- Infer extra superior Bloom filters and dynamic partition pruning filters from complicated question plans to cut back quantity of information shuffled and browse from Amazon Easy Storage Service (Amazon S3)
- Fewer requests to Amazon S3 – We diminished requests despatched to Amazon S3 when studying Parquet recordsdata by minimizing pointless requests and introducing a cache for Parquet footers.
- Java 17 as default Java runtime utilized in Amazon EMR 7.0 – Java 17 was extensively examined and tuned for optimum efficiency, permitting us to make it the default Java runtime for Amazon EMR 7.0.
For extra particulars on EMR Spark efficiency optimizations, discuss with Optimize Spark efficiency.
On this publish, we share the testing methodology and benchmark outcomes evaluating the newest Amazon EMR variations (7.0 and seven.1) with the EOY 2022 launch (model 6.9) and Apache Spark 3.5.1 to show the newest price enhancements Amazon EMR has achieved.
Benchmark outcomes for Amazon EMR 7.1 vs. Apache Spark 3.5.1
To judge the Spark engine efficiency, we ran benchmark assessments with the three TB TPC-DS dataset. We used EMR Spark clusters for benchmark assessments on Amazon EMR and put in Apache Spark 3.5.1 on Amazon Elastic Compute Cloud (Amazon EC2) clusters designated for open supply Spark (OSS) benchmark runs. We ran assessments on separate EC2 clusters comprised of 9 r5d.4xlarge cases for every of Apache Spark 3.5.1, Amazon EMR 6.9.0, and Amazon EMR 7.1. The first node has 16 vCPU and 128 GB reminiscence and eight employee nodes have a complete of 128 vCPU and 1024 GB reminiscence. We examined with Amazon EMR defaults to focus on the out-of-the-box expertise and tuned Apache Spark with the minimal settings wanted to offer a good comparability.
For the supply information, we selected the three TB scale issue, which incorporates 17.7 billion information, roughly 924 GB of compressed information in Parquet file format. The setup directions and technical particulars may be discovered within the GitHub repository. We used Spark’s in-memory information catalog to retailer metadata for TPC-DS databases and tables. spark.sql.catalogImplementation is ready to the default worth in-memory. The actual fact tables are partitioned by the date column, which consists of partitions starting from 200–2,100. No statistics had been pre-calculated for these tables.
A complete of 104 SparkSQL queries had been run in three iterations sequentially and a mean of every question’s runtime in these three iterations was used for comparability. The common of the three iterations’ runtime on Amazon EMR 7.1 was 0.51 hours, which is 1.9 occasions sooner than Amazon EMR 6.9 and 4.5 occasions sooner than Apache Spark 3.5.1. The next determine illustrates the overall runtimes in seconds.

The per-query speedup on Amazon EMR 7.1 when in comparison with Apache Spark 3.5.1 is illustrated within the following chart. Though Amazon EMR is quicker than Apache Spark on all TPC-DS queries, the speedup is far larger on some queries than on others. The horizontal axis represents queries within the TPC-DS 3 TB benchmark ordered by the Amazon EMR speedup descending and the vertical axis exhibits the speedup of queries as a result of Amazon EMR runtime.

Price comparability
Our benchmark outputs the overall runtime and geometric imply figures to measure the Spark runtime efficiency by simulating a real-world complicated determination assist use case. The fee metric can present us with further insights. Price estimates are computed utilizing the next formulation. They consider Amazon EC2, Amazon Elastic Block Retailer (Amazon EBS), and Amazon EMR prices, however don’t embrace Amazon S3 GET and PUT prices.
- Amazon EC2 price (embrace SSD price) = variety of cases * r5d.4xlarge hourly fee * job runtime in hours
- 4xlarge hourly fee = $1.152 per hour
- Root Amazon EBS price = variety of cases * Amazon EBS per GB-hourly fee * root EBS quantity measurement * job runtime in hours
- Amazon EMR price = variety of cases * r5d.4xlarge Amazon EMR price * job runtime in hours
- 4xlarge Amazon EMR price = $0.27 per hour
- Whole price = Amazon EC2 price + root Amazon EBS price + Amazon EMR price
Primarily based on the calculation, the Amazon EMR 7.1 benchmark end result demonstrates a 2.8 occasions enchancment in job price in comparison with Apache Spark 3.5.1 and a 1.7 occasions enchancment when in comparison with Amazon EMR 6.9.
| Metric | Amazon EMR 7.1 | Amazon EMR 6.9 | Apache Spark 3.5.1 |
| Runtime in hours | 0.51 | 0.87 | 1.76 |
| Variety of EC2 cases | 9 | 9 | 9 |
| Amazon EBS Dimension | 20gb | 20gb | 20gb |
| Amazon EC2 price | $5.29 | $9.02 | $18.25 |
| Amazon EBS price | $0.01 | $0.02 | $0.04 |
| Amazon EMR price | $1.24 | $2.11 | $0.00 |
| Whole price | $6.54 | $11.15 | $18.29 |
| Price Financial savings | Baseline | Amazon EMR 7.1 is 1.7 occasions higher | Amazon EMR 7.1 is 2.8 occasions higher |
Run OSS Spark benchmarking
For working Apache Spark 3.5.1, we used the next configurations to arrange an EC2 cluster. We used one major node and eight employee nodes of sort r5d.4xlarge.
| EC2 Occasion | vCPU | Reminiscence (GiB) | Occasion Storage (GB) | EBS Root Quantity (GB) |
| r5d.4xlarge | 16 | 128 | 2 x 300 NVMe SSD | 20GB |
Conditions
The next conditions are required to run the benchmarking:
- Utilizing the directions within the emr-spark-benchmark GitHub repo, arrange the TPC-DS supply information in your S3 bucket and your native laptop.
- Construct the benchmark utility following the steps offered in Steps to construct spark-benchmark-assembly utility and replica the benchmark utility to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.1.jar to your S3 bucket.
This benchmark utility is constructed from department tpcds-v2.13. When you’re constructing a brand new benchmark utility, change to the proper department after downloading the supply code from the GitHub repo.
Create and configure a YARN cluster on Amazon EC2
Comply with the directions within the emr-spark-benchmark GitHub repo to create an OSS Spark cluster on Amazon EC2 utilizing Flintrock.
Primarily based on the cluster choice for this take a look at, the next are the configurations used:
Run the TPC-DS benchmark for Apache Spark 3.5.1
Full the next steps to run the TPC-DS benchmark for Apache Spark 3.5.1:
- Log in to the OSS cluster major utilizing
flintrock login $CLUSTER_NAME. - Submit your Spark job:
- The TPC-DS supply information is at
s3a://<YOUR_S3_BUCKET>/BLOG_TPCDS-TEST-3T-partitioned. Verify the conditions on how one can arrange the supply information. - The outcomes are created in
s3a://<YOUR_S3_BUCKET>/benchmark_run. - You possibly can observe progress in
/media/ephemeral0/spark_run.log.
- The TPC-DS supply information is at
Summarize the outcomes
When the Spark job is full, obtain the take a look at end result file from the output S3 bucket s3a://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv. You need to use the Amazon S3 console and navigate to the output bucket location or use the Amazon Command Line Interface (AWS CLI).
The Spark benchmark utility creates a timestamp folder and writes a abstract file inside a abstract.csv prefix. Your timestamp and file identify can be totally different from the one proven within the previous instance.
The output CSV recordsdata have 4 columns with out header names:
- Question identify
- Median time
- Minimal time
- Most time
As a result of we now have three runs, we will then compute the typical and geometric imply of the runtimes.
Run the TPC-DS benchmark utilizing Amazon EMR Spark
For detailed directions, see Steps to run Spark Benchmarking.
Conditions
Full the next prerequisite steps:
- Run aws configure to configure your AWS CLI shell to level to the benchmarking account. Consult with Configure the AWS CLI for directions.
- Add the benchmark utility to Amazon S3.
Deploy the EMR cluster and run the benchmark job
Full the next steps to run the benchmark job:
- Use the AWS CLI command as proven in Deploy EMR Cluster and run benchmark job to spin up an EMR on EC2 cluster. Replace the offered script with the proper Amazon EMR model and root quantity measurement, and supply the values required. Consult with create-cluster for an in depth description of the AWS CLI choices.
- Retailer the cluster ID from the response. You want this within the subsequent step.
- Submit the benchmark job in Amazon EMR utilizing add-steps within the AWS CLI:
- Change <cluster ID> with the cluster ID from the create cluster response.
- The benchmark utility is at
s3://<YOUR_S3_BUCKET>/spark-benchmark-assembly-3.5.1.jar. - The TPC-DS supply information is at
s3://<YOUR_S3_BUCKET>/BLOG_TPCDS-TEST-3T-partitioned. - The outcomes are created in
s3://<YOUR_S3_BUCKET>/benchmark_run.
Summarize the outcomes
After the job is full, retrieve the abstract outcomes from s3://<YOUR_S3_BUCKET>/benchmark_run in the identical means because the OSS benchmark runs and compute the typical and geomean for Amazon EMR runs.
Clear up
To keep away from incurring future prices, delete the sources you created utilizing the directions within the Cleanup part of the GitHub repo.
Abstract
Amazon EMR continues to enhance the EMR runtime for Apache Spark, resulting in a efficiency enchancment of 1.9x year-over-year and 4.5x sooner efficiency than OSS Spark 3.5.1. We advocate that you just keep updated with the newest Amazon EMR launch to reap the benefits of the newest efficiency advantages.
To maintain updated, subscribe to the Huge Knowledge Weblog’s RSS feed to be taught extra in regards to the EMR runtime for Apache Spark, configuration finest practices, and tuning recommendation.
In regards to the creator
Ashok Chintalapati is a software program improvement engineer for Amazon EMR at Amazon Net Companies.
Steve Koonce is an Engineering Supervisor for EMR at Amazon Net Companies.
