BIG DATA

Quick Start for Amazon EMR by nClouds

  • - Pre-built Amazon EMR stack provides a fast path to build and deploy big data analytics applications.
  • - Identify Spot and Dedicated Instance discounts with intelligent pricing option.
Contact us

Big Data frameworks

Software frameworks like Apache Hadoop can help you process large data sets by distributing the data and processing across many computers. But deploying, configuring, and managing these distributed clusters can be difficult, time-consuming, and expensive.

Amazon EMR: faster innovation, lower experimentation costs

Amazon EMR is a managed Hadoop framework that uses the elastic infrastructure of Amazon EC2 and Amazon S3 to make it easy, fast, and cost-effective to distribute computation of your data across multiple, dynamically-scalable EC2 instances.

You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

EMR manages the clusters so you can focus on analyzing the data.

Screenshot

Amazon EMR basic architecture

Benefits of Amazon EMR

Adapted from Amazon EMR
  • Elastic

    With Amazon EMR, you can provision one, hundreds and even thousands of compute instances to process data at any scale. You can easily increase or decrease the number of instances manually or with AutoScaling, and you only pay for what you use.

  • Reliable

    You can spend less time tuning and monitoring your cluster. Amazon EMR has tuned Hadoop for the cloud; it also monitors your cluster —retrying failed tasks and automatically replacing poorly performing instances.

  • Secure

    Amazon EMR automatically configures Amazon EC2 firewall settings that control network access to instances, and you can launch clusters in an Amazon Virtual Private Cloud (VPC), a logically isolated network you define. For objects stored in Amazon S3, you can use Amazon S3 server-side encryption or Amazon S3 client-side encryption with EMRFS, with AWS Key Management Service or customer-managed keys. You can also easily enable other encryption options and authentication with Kerberos.

  • Flexible

    You have complete control over your cluster. You have root access to every instance, you can easily install additional applications, and you can customize every cluster with bootstrap actions. You can also launch Amazon EMR clusters with custom Amazon Linux AMIs.

  • Easy to Use

    You can launch an Amazon EMR cluster in minutes. You don’t need to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning. Amazon EMR takes care of these tasks so you can focus on analysis.

Quick Start for Amazon EMR by nClouds

At nClouds, we wanted to make it fast and easy to get started with Amazon EMR so we created a Quick Start for Amazon EMR. You can get up and running fast with all your use cases, and we’ve made it really easy to use Spot and Dedicated Instance discounts.

Go faster & reduce costs:

  • Stand up clusters fast using AWS CloudFormation templates and end-to-end automation.
  • Identify Spot instance discounts to reduce costs using the intelligent pricing option added to CloudFormation.
  • Automatically shut down the cluster after scheduled use to save money.

Quick Start Demo Use Case

button

Provisioned resources summary

  • Launch the EMR cluster using CloudFormation (CF) stack.
  • Create a new VPC for the EMR cluster instances.
  • Use Spot instance – this CloudFormation template uses the Spot instance bid amount of $0.100.
  • Run a Spark Job, take input from S3 location, and drop output in S3 location.
Please note: This will not be under AWS free tier and will be billed. The EMR cluster is not terminated automatically and should be deleted manually from the CloudFormation console when done.
  • This is a small demo for getting started with AWS EMR.
  • The included Cloud Formation template launches an EMR stack in a new VPC and executes a small step on Apache Spark.
  • It takes a CSV file as input from S3 containing Census data with male and female population data.
  • The PySpark script calculates the gender ratio and adds it as a new column in the CSV file and uploads it to the output bucket.
  • The output bucket must be specified while creating a stack.

Quick Start Demo Steps

Launch CF by clicking

button Screenshot

Enter the parameter specific to your AWS account and submit the CF stack.

Screenshot

CF stack creation in progress.

Screenshot Screenshot

CF stack creates EMR cluster.

Screenshot Screenshot

Nodes created in new VPC created from CF stack.

Screenshot

Stack created successfully.

Screenshot

EMR cluster ready to take jobs.

Screenshot

CF stack executes Spark Job successfully.

Screenshot

Output of Spark Job pushed to S3 bucket.

Screenshot

To access the cluster via SSH, you will need to replace the default key, nclouds-emr-demo, with a real one already set up in your AWS account.

The other key parameters to review are:

  • Cluster Instance Count, default: 1 master and 2 core instances
  • Instance size, default: m4.large
  • EBS volume size, default: 20GB
  • The Spot instance bid amount, default: $0.100, especially if you modify the instance type.

In addition to the above, you will probably want to set up your own workload, by updating the sample Spark step.

Reference

How can we help?

We'd love to discuss your big data project with Amazon EMR.

Contact us

Get more info on Amazon EMR

You can email us directly at sales@nclouds.com or use the form below

Contact Us Now

You can also email us directly at sales@nclouds.com for your inquiries or use the form below