Updated: Mar 6
AWS Marketplace Senior Analytics Specialist Solution Architect Recommends Subscribing to Windjammer EMR Accelerator
Apache Spark is a widely used enterprise standard for distributed big data processing. Spark has gained popularity due to its fast in-memory computing that allows for parallel computation of tasks across multiple nodes. To aid customers with running Spark workloads, Amazon EMR provides a managed cluster platform that makes it easy to run frameworks such as Apache Hadoop, Apache Spark, and Apache Hive.
Although Amazon EMR has made many enhancements to the Spark framework, the massive scale of use has exposed certain shortcomings with Spark. Users of Spark for data processing have likely experienced issues with Spark’s JVM being very CPU intensive, resulting in high operating costs and server sprawl, Spark not fully exploiting the bandwidth of Amazon S3, and the need for expensive shuffle operations for fault tolerance.
Windjammer is a member of the AWS Partner Network (APN) and is available in AWS Marketplace. Windjammer EMR Accelerator is a 100 percent compatible native Spark plug-in that can be implemented on top of Amazon EMR to seamlessly address these Spark bottlenecks, providing cost reduction, enhanced performance, and added fault tolerance.
In this post, Kai and I share how to implement the Windjammer Spark plug-in on your EMR cluster to achieve performance improvements and cost reduction on your Spark workloads. The Windjammer framework also features the ability to perform A/B testing to measure the impact the Accelerator has on your current workloads.
You must complete the following prerequisites before implementing the Windjammer EMR Accelerator:
Subscribe to EMR Accelerator (EMR 6.X/SPARK) via AWS Marketplace.
Choose View purchase options, then choose Subscribe, and then choose Set up your account to access the Windjammer Welcome page.
On the Welcome page, enter your email address and select Submit.
The Windjammer EMR Accelerator: Guide to Deployment, Management, and A/B Benchmarking POCs will be displayed.
Windjammer EMR Accelerator is implemented as a Spark plug-in that optimizes Spark SQL query execution in a completely compatible way. The Accelerator is installed on the EMR cluster via an EMR bootstrap action and requires no changes to the Spark application or code.
The EMR Accelerator native plug-in addresses the Spark shortcomings highlighted in the introduction by reducing query CPU utilization to ¼, increasing Amazon S3 bandwidth utilization by 3X, and utilizing Amazon S3-based checkpointing to provide efficient fault tolerance without a shuffle service. This results in eliminating server sprawl while improving performance and availability.
TPC-DS benchmarks run on EMR 6.X show that the EMR Accelerator improves Spark SQL runtime performance by a factor of 2.6 times over stock Amazon EMR Spark. This means that the Amazon EMR core and task node configurations can be reduced by a factor of 4 while still improving Spark SQL execution by 1.5 times, resulting in cost savings and performance improvement.
The Windjammer framework provides A/B benchmarking scripts for both sample datasets and your own queries to make it easy to clearly observe the impact of the EMR Accelerator on your Spark workloads. For full architecture, design, and performance benchmarking details, please see the Windjammer technical whitepaper.
Here is a step-by-step guide for creating EMR clusters with the EMR Accelerator, running A/B benchmarking, and running Spark workloads with the Accelerator.
Step 1: Create and configure an EMR cluster with EMR Accelerator
Please note that currently, the sample datasets for A/B benchmarking are available in limited Regions:
US East (N. Virginia)
US East (Ohio)
US West (Oregon)
If you would like to run the A/B benchmarking with sample data in an unavailable Region, you will need to copy the sample TPC-DS dataset from an available Region to a local bucket in your account. Data transfer costs may be incurred. Please see the Windjammer Deployment and POC Guide for more details.
Step 1.1 Select the software and steps for the cluster.
On the Amazon EMR console, select Create cluster.
Select Go to advanced options.
Under Software configuration, select an Amazon EMR 6.X release (6.2 and above are supported).
At minimum, select Hadoop 3.2.1+, Hive 3.1.2+, and Spark 3.1.2+ applications.
In the Steps section, select Custom JAR as the Step type and select Add step.
In the Add step dialog box:
Set JAR location to command-runner.jar
Set Arguments to /tmp/wjm-prep
Step 1.2 Select the hardware for the cluster.
The Amazon Elastic Compute Cloud (Amazon EC2) instance type selections below represent a balanced cluster configuration for running the EMR Accelerator framework with the TPC-DS workload at 3 TB or 10 TB scale. When selecting different EC2 instance types, please select instances that have networking performance of approximately 1 Gbps/vCPU, for example, EC2 instances with an “n” in their name). EMR Accelerator does not require instances with local SSDs. Typically, EMR clusters running the EMR Accelerator require fewer EC2 instances than stock EMR clusters while still accelerating performance.
Select Master node type: m5.xlarge.
For the Core or Task nodes, select r5dn.4xlarge.
Provision with 2 or 4 instances of either Core or Task nodes.
Configure the Root device EBS volume size to at least 50 GiB
Step 1.3 Select the general cluster settings.
You may optionally choose the available cluster settings; however, only the necessary steps will be outlined here.
Choose Bootstrap actions:
Set Add bootstrap action to Custom action.
Choose Configure and Add.
In Script location, enter s3://wjm-build-1-5/bootstrap
Optional arguments: If you would like to optionally run A/B benchmarking and your cluster is in one of the AWS Regions listed in Step 1, you do not need any other bootstrap arguments. If you are not running in one of these Regions and have copied the dataset to a bucket in your account and Region, specify the bucket name WJM_DATASETS=<local TPC-DS bucket name> in the Optional arguments dialog. Note that WJM_DATASETS must be in capital letters, and the bucket is specified without the s3:// prefix.
Step 1.4 Select the security for the cluster.
Select an EC2 key pair.
You may proceed without a key pair but will be unable connect to the nodes by using SSH.
Select Create cluster.
Step 2: Running A/B benchmarks on Spark workloads
Running TPC-DS queries following the steps below will also generate a performance report comparing EMR Accelerator to Amazon EMR Spark.
Step 2.1 Running A/B benchmarks on pre-installed TPC-DS queries
Select your cluster from the Amazon EMR console.
Select the Steps tab.
Select Add step.
In the Add step dialog:
Set JAR location to command-runner.jar
Set Arguments to:
/opt/wjm/bin/sperfjob <size of dataset in TB> <query number>* "sne emr"
For example: /opt/wjm/bin/sperfjob 3 3 "sne emr" Select Add.
The Steps tab will show the status of the executing query step.
Once the step has completed, you may select View logs and then the stdout link to see a full report comparing the two benchmarks.
* For a full list of the queries available, please see the /opt/wjm/queries directory on the cluster's master node.
You may use the Windjammer performance testing infrastructure to run custom queries that evaluate EMR Acclerator against your own workloads. To execute custom queries with A/B benchmarking, you need to copy your SQL query and query results file to the master node. Please see the Windjammer Deployment and POC Guide for more details.
Step 3: Running Spark workloads with EMR Accelerator
Once you have evaluated the EMR Accelerator and observed the impact on your Spark workloads, you can begin to run your Spark workloads with the EMR Accelerator as you would normally.
EMR Accelerator requires no changes to existing methods of submitting Spark SQL jobs, and no additional options to spark-submit invocations are necessary. Amazon EMR notebooks may be used as is, and there are no restrictions on application languages. As always, please perform sufficient testing prior to running production workloads.
After you have finished, to release resources and prevent further costs, perform the following:
Terminate the EMR cluster that you set up for use with the EMR Accelerator.
If a local copy of the TPC-DS datasets bucket (s3:// wjm-datasets-<region>) was made, empty and delete the S3 bucket.
In this post, we showed you how to use Windjammer EMR Accelerator to improve performance while reducing costs on Spark workloads. We also showed you how to perform A/B benchmarking with the Windjammer framework to observe the impact on Spark workloads.
About the authors
Kai Rothauge is Chief Technology Officer at Windjammer Technologies, where he leads Windjammer Accelerator architecture and design. Prior to Windjammer, Kai was a Postdoctoral Researcher at the University of California, Berkeley, in the RISELab, where he developed Alchemist to accelerate Spark for large-scale data analysis by offloading to high-performance computing libraries. Kai received his PhD from the University of British Columbia in Applied Mathematics and his MMath from the University of Bath. Prior to his doctoral studies, he also completed scientific internships at the Max Planck Institute, the Fraunhofer Institute, and CSIRO.
Amber Runnels is a Senior Analytics Specialist Solutions Architect at AWS specializing in big data and distributed systems. She helps customers optimize workloads in the AWS data ecosystem to achieve a scalable, performant, and cost-effective architecture. Aside from technology, she is passionate about exploring the many places and cultures this world has to offer, reading novels, and building terrariums.