[HTML payload içeriği buraya]
32.3 C
Jakarta
Tuesday, May 12, 2026

Amazon EMR Serverless eliminates native storage provisioning, decreasing knowledge processing prices by as much as 20%


At AWS re:Invent 2025, Amazon Net Providers (AWS) introduced serverless storage for Amazon EMR Serverless, a brand new functionality that eliminates the necessity configure native disks for Apache Spark workloads. This reduces knowledge processing prices by as much as 20% whereas eliminating job failures from disk capability constraints.

With serverless storage, Amazon EMR Serverless routinely handles intermediate knowledge operations, reminiscent of shuffle, in your behalf. You pay just for compute and reminiscence—no storage fees. By decoupling storage from compute, Spark can launch idle staff instantly, decreasing prices all through the job lifecycle. The next picture reveals the serverless storage for EMR Serverless announcement from the AWS re:Invent 2025 keynote:

The problem: Sizing native disk storage

Working Apache Spark workloads requires sizing native disk storage for shuffle operations—the place Spark redistributes knowledge throughout executors throughout joins, aggregations, and kinds. This requires analyzing job histories to estimate disk necessities, main to 2 widespread issues: overprovisioning wastes cash on unused capability, and underneath provisioning causes job failures when disk house runs out. Most clients overprovision native storage to make sure jobs full efficiently in manufacturing.

Information skew compounds this additional. When one executor handles a disproportionately massive partition, that executor takes considerably longer to finish whereas different staff sit idle. In the event you didn’t provision sufficient disk for that skewed executor, the job fails solely—making knowledge skew one of many prime causes of Spark job failures. Nonetheless, the issue extends past capability planning. As a result of shuffle knowledge {couples} tightly to native disks, Spark executors pin to employee nodes even when compute necessities drop between job levels. This prevents Spark from releasing staff and cutting down, inflating compute prices all through the job lifecycle. When a employee node fails, Spark should recompute the shuffle knowledge saved on that node, inflicting delays and inefficient useful resource utilization.

The way it works

Serverless storage for Amazon EMR Serverless addresses these challenges by offloading shuffle operations from particular person compute staff onto a separate, elastic storage layer. As an alternative of storing essential knowledge on native disks connected to Spark executors, serverless storage routinely provisions and scales high-performance distant storage as your job runs.

The structure supplies a number of key advantages. First, compute and storage scale independently—Spark can purchase and launch staff as wanted throughout job levels with out worrying about preserving domestically saved knowledge. Second, shuffle knowledge is evenly distributed throughout the serverless storage layer, eliminating knowledge skew bottlenecks that happen when some executors deal with disproportionately massive shuffle partitions. Third, if a employee node fails, your job continues processing with out delays or reruns as a result of knowledge is reliably saved outdoors particular person compute staff.

Serverless storage is offered at no further cost, and it eliminates the fee related to native storage. As an alternative of paying for mounted disk capability sized for optimum potential I/O load—capability that usually sits idle throughout lighter workloads—you should utilize serverless storage with out incurring storage prices. You may focus your funds on compute sources that straight course of your knowledge, not on managing and overprovisioning disk storage.

Technical innovation brings three breakthroughs

Serverless storage introduces three elementary improvements that remedy Spark’s shuffle bottlenecks: multi-tier aggregation structure, purpose-built networking, and true storage-compute decoupling. Apache Spark’s shuffle mechanism has a core constraint: every mapper independently writes output as small recordsdata, and every reducer should fetch knowledge from doubtlessly 1000’s of staff. In a large-scale job with 10,000 mappers and 1,000 reducers, this creates 10 million particular person knowledge exchanges. Serverless storage aggregates early and intelligently—mappers stream knowledge to an aggregation layer that consolidates shuffle knowledge in reminiscence earlier than committing to storage. Whereas particular person shuffle write and fetch operations may present barely greater latency attributable to community round-trips in comparison with native disk I/O, the general job efficiency improves by remodeling tens of millions of tiny I/O operations right into a smaller variety of massive, sequential operations.

Conventional Spark shuffle creates a mesh community the place every employee maintains connections to doubtlessly lots of of different staff, spending important CPU on connection administration somewhat than knowledge processing. We constructed a customized networking stack the place every mapper opens a single persistent distant process name (RPC) connection to our aggregator layer, eliminating the mesh complexity. Though particular person shuffle operations may present barely greater latency attributable to community spherical journeys in comparison with native disk I/O, general job efficiency improves via higher useful resource utilization and elastic scaling. Employees not run a shuffle service—they focus solely on processing your knowledge.

Conventional Amazon EMR Serverless jobs retailer shuffle knowledge on native disks, coupling knowledge lifecycle to employee lifecycle—idle staff can’t terminate with out dropping shuffle knowledge. Serverless storage decouples these solely by storing shuffle knowledge in AWS managed storage with opaque handles tracked by the motive force. Employees can terminate instantly after finishing duties with out knowledge loss, enabling elastic scaling. In funnel-shaped queries the place early levels require large parallelism that narrows as knowledge aggregates, we’re seeing as much as 80% compute price discount in benchmarks by releasing idle staff immediately. The next diagram illustrates prompt employee launch in funnel-shaped queries.

Our aggregator layer integrates straight with AWS Identification and Entry Administration (IAM), AWS Lake Formation, and fine-grained entry management techniques, offering job-level knowledge isolation with entry controls that match supply knowledge permissions.

Getting began

Serverless storage is on the market in a number of AWS Areas. For the present record of supported Areas, confer with the Amazon EMR Person Information.

New purposes

Serverless storage will be enabled for brand spanking new purposes beginning with Amazon EMR launch 7.12. Comply with these steps:

  1. Create an Amazon EMR Serverless software with Amazon EMR 7.12 or later:
aws emr-serverless create-application 
  --type "SPARK" 
  --name my-application 
  --release-label emr-7.12.0 
  --runtime-configuration '[{
      "classification": "spark-defaults",
        "properties": {
          "spark.aws.serverlessStorage.enabled": "true"
        }
    }]' 
  --region us-east-1

  1. Submit your Spark job:
aws emr-serverless start-job-run 
  --application-id <application-id> 
  --execution-role-arn <execution-role-arn> 
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://<bucket>/<your_script.py>",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.reminiscence=20g --conf spark.driver.cores=4 --conf spark.driver.reminiscence=8g --conf spark.executor.situations=10"
    }
  }'

Present purposes

You may allow serverless storage for present purposes on Amazon EMR 7.12 or later by updating your software settings.

To allow serverless storage utilizing AWS Command Line Interface (AWS CLI), enter the next command:

aws emr-serverless update-application 
  --application-id <application-id> 
  --runtime-configuration '[{
      "classification": "spark-defaults",
        "properties": {
          "spark.aws.serverlessStorage.enabled": "true"
        }
    }]'

To allow serverless storage utilizing Amazon EMR Studio UI, navigate to your software in Amazon EMR Studio, go to Configuration, and add the Spark property spark.aws.serverlessStorage.enabled=true within the spark-defaults classification.

Job-level configuration

You may as well allow serverless storage for particular jobs, even when it’s not enabled on the software stage:

aws emr-serverless start-job-run 
  --application-id <application-id> 
  --execution-role-arn <execution-role-arn> 
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://<bucket>/<your_script.py>",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.reminiscence=20g --conf spark.aws.serverlessStorage.enabled=true"
    }
  }'

(Elective) Disabling serverless storage

In the event you want to proceed utilizing native disks, you may disable serverless storage by omitting the spark.aws.serverlessStorage.enabled configuration or setting it to false at both the applying or job stage:

spark.aws.serverlessStorage.enabled=falseTo make use of conventional native disk provisioning, configure the suitable disk sort and dimension on your software staff.

Monitoring and price monitoring

You may monitor elastic shuffle utilization via customary Spark UI metrics and observe prices on the software stage in AWS Value Explorer and AWS Value and Utilization Reviews. The service routinely handles efficiency optimization and scaling, so that you don’t must tune configuration parameters.

When to make use of serverless storage

Serverless storage delivers probably the most worth for workloads with substantial shuffle operations—sometimes jobs that shuffle greater than 10 GB of knowledge (and fewer than 200 G per job, the limitation as of this writing). These embody:

  • Massive-scale knowledge processing with heavy aggregations and joins
  • Type-heavy analytics workloads
  • Iterative algorithms that repeatedly entry the identical datasets

Jobs with unpredictable shuffle sizes profit notably nicely as a result of serverless storage routinely scales capability up and down based mostly on real-time demand. For workloads with minimal shuffle exercise or very brief period (underneath 2–3 minutes), the advantages may be restricted. In these instances, the overhead of distant storage entry may outweigh some great benefits of elastic scaling.

Safety and knowledge lifecycle

Your knowledge is saved in serverless storage solely whereas your job is operating and is routinely deleted when your job is accomplished. As a result of Amazon EMR Serverless batch jobs can run for as much as 24 hours, your knowledge will probably be saved for not than this most period. Serverless storage encrypts your knowledge each in transit between your Amazon EMR Serverless software and the serverless storage layer and at relaxation whereas quickly saved, utilizing AWS managed encryption keys. The service makes use of an IAM based mostly safety mannequin with job-level knowledge isolation, which signifies that one job can’t entry the shuffle knowledge of one other job. Serverless storage maintains the identical safety requirements as Amazon EMR Serverless, with enterprise-grade safety controls all through the processing lifecycle.

Conclusion

Serverless storage represents a elementary shift in how we method knowledge processing infrastructure, eliminating guide configuration, aligning prices to precise utilization, and enhancing reliability for I/O intensive workloads. By offloading shuffle operations to a managed service, knowledge engineers can concentrate on constructing analytics somewhat than managing storage infrastructure.

To study extra about serverless storage and get began, go to the Amazon EMR Serverless documentation.


Concerning the authors

Karthik Prabhakar

Karthik Prabhakar

Karthik is a Information Processing Engines Architect for Amazon EMR at AWS. He makes a speciality of distributed techniques structure and question optimization, working with clients to unravel advanced efficiency challenges in large-scale knowledge processing workloads. His focus spans engine internals, price optimization methods, and architectural patterns that allow clients to run petabyte-scale analytics effectively.

Ravi Kumar

Ravi Kumar

Ravi is a Senior Product Supervisor Technical at Amazon Net Providers, specializing in exabyte-scale knowledge infrastructure and analytics platforms. He helps clients unlock insights from structured and unstructured knowledge utilizing open-source applied sciences and cloud computing. Outdoors of labor, Ravi enjoys exploring rising tendencies in knowledge science and machine studying.

Matt Tolton

Matt Tolton

Matt is a Senior Principal Engineer at Amazon Net Providers.

author name

Neil Mukerje

Neil is a Principal Product Supervisor at Amazon Net Providers.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles