EMR stands for Elastic Map Reduce.Amazon EMR is the service provided on Amazon clouds to run managed Hadoop cluster. EMR Hadoop cluster runs on virtual servers running on Amazon EC2 instances. Different enhancements has been done by Amazon team on the Hadoop version installed as EMR so that it can work seamlessly with other Amazon services like S3,Amazon Kinesis and CloudWatch to monitor the performance of cluster etc.
Jobs,Tasks and STEPS
In Hadoop, a job is unit of work.Each job contains one or more tasks (e.g. map or reduce tasks) and each task can be attempted for more than once.But Amazon EMR has added one more unit of work – STEP.
A Step contains one or more Hadoop Jobs. You can also track the status of steps within a cluster.For e.g. a cluster that process encrypted data might contain following steps –
Step 1 Decrypt data
Step 2 Process data
Step 3 Encrypt data
Step 4 Save data
Life Cycle of EMR Cluster
Amazon EMR first provisions a Hadoop cluster.At this time, state of cluster remains STARTING.Next, any user defined bootstrap actions are run.This is called BOOTSTRAPPING phase.This is the step after which you will charged for this provisioned cluster.
After all the bootstrap steps are completed, cluster comes in RUNNING state.After this all the steps listed in Job flow runs sequentially.
If you have configured cluster as long-running cluster by checking keep alive configuration, cluster will go in WAITING state after completing all the steps else it will shutdown.
Setting up EMR cluster
Assuming that you already have signed up for AWS account below are the required steps for setting up EMR cluster –
Create Amazon S3 buckets for cluster logs and output
Amazon EMR can use S3 for storing inputs as well.
- Open Amazon S3 console and create bucket
- Create folders output and logs
Launch EMR Cluster
- Open Amazon EMR console and create cluster.
- Choose the exact log directory created in previous step.
- Choose the required Hadoop distribution.For setting up Amazon EMR cluster use the default value – Amazon
- Choose the instance type as m3.x-large for this demo purpose and number of instances to be 3 (1-master, 2 core nodes).
- For demo purpose proceed without selecting EC2 key-pair.
At this moment you cluster will be in PROVISIONING state.After few minutes it will come in BOOTSTRAPPING phase and ultimately in RUNNING state.
Running the Hive Script
For demoing purpose we will load the sample data into Hive table and query it to store the output in S3.
For this demo we will create a External Hive table to read the data from CloudFront logs which are of following format –
2014-07-05 20:00:00 LHR3 4260 10.0.0.15 GET eabcd12345678.cloudfront.net /test-image-1.jpeg 200 - Mozilla/5.0%20(MacOS;%20U;%20Windows%20NT%205.1;%20en-US;%20rv:188.8.131.52)%20Gecko/2009040821%20IE/3.0.9
The External Hive table that we will create is like below –
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date Date, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, OS String, Browser String, BrowserVersion String ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" ) LOCATION 's3://us-west-2.elasticmapreduce.samples/cloudfront/data/';
After this Hive Script will submit the below HiveQL query to get total request per OS for a given period.Results would be written to S3 bucket created earlier.
SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN '2014-07-05' AND '2014-08-05' GROUP BY os;
Adding Hive Script as Step
- Open EMR Console and click Add Step as shown in snapshot.
- We will be using the sample script and sample location provided by Amazon to run this demo.So accordingly give the path of script and input as shown in snapshots.
- Configure the output location as output directory created as S3 bucket in earlier steps.
Sample Input location and sample hive script location starts with –
[myregion] is region of your cluster which you can get from Hardware configurations as shown below –
Strip last alphabet (in this case ‘a’ from us-east-1a) to get
Once Step is added it will run automatically and output will be stored in output directory configured in previous step.
View the results
- Open S3 console and go to the output directory created in first step.
- Download the created file to check the results.
Output file looks like as per below –
As we see Amazon EMR make our life much easy as you create your Hadoop cluster and scale it with just few clicks.