Using Apache Hive With Amazon Elastic MapReduce - 1 of 2

Uploaded by AmazonWebServices on 01.10.2009

Hello. My name is Andrew Hitchcock and I'm engineer on the Amazon Elastic MapReduce team.
Today I'm going to demonstrate how to run Apache Hive on Elastic MapReduce.
Elastic MapReduce is a service which runs the Hadoop framework on Amazon EC2.
Durable storage for the cluster is provided by Amazon S3.
Hive is an open source, highly scalable data warehouse which runs on top of Hadoop.
You interact with Hive by writing SQL-like queries in a language called HiveQL.
They are compiled down to MapReduce jobs and run on your Hadoop cluster.
For the purpose of this demo, we are going to use the example of an ad serving company.
We have a fleet of ad servers in EC2 which upload their logs to S3 every five minutes.
These logs are stored in a Hive table.
We spawn an Elastic MapReduce job flow to join the two tables and store the result back
in S3.
For this demonstration I'm going to show you how to start an interactive job flow using
the Elastic MapReduce console,
we'll read tables that are stored in S3,
I'll show you how to use a custom JSON SerDe that was written by Amazon,
and also the recover partition feature that was written by Amazon.
Then we'll join tables and insert them back into a partitioned table.
To begin, we should pull up the AWS console,
and navigate to the Amazon Elastic MapReduce tab.
From here, we can create a new job flow.
We should begin by giving it a descriptive name. This helps us identify it later when
we have multiple jobs running.
We can then select which type of job flow we'd like to run, which in this case is a
Hive program.
If we click continue, we'll be taken to the next step in starting our job flow.
Here you can see two options, Hive script or interactive Hive session.
The interactive Hive session allows you to SSH to your cluster and run your queries using
the Hive command line client.
The Hive script option allows you to run a HiveQL script stored in Amazon S3. This is
ideal for batch processes.
However, since I want to demonstrate the features of Hive, we'll create an interactive job flow.
This next page allows you to configure the EC2 instances that will be used to power your
I'm going to start by bumping the number of instances up to 10 and selecting the m1.xlarge
instance type.
Under advanced options, you'll find the Amazon S3 log path.
If specified when starting your job flow, we'll upload all the job flow logs to your
When creating an interactive job flow an EC2 keypair. This is what allows you to SSH onto
the master node once your cluster starts.
The final page lets you review your job flow settings.
You can go back and change things that are wrong, but if everything looks okay, go ahead
and click create job flow.
You'll see a confirmation page if everything went okay.
Now we are back at the list of job flows. You can click on your job flow to see more
If you scroll down you can see that your job flow already has one step, which sets up Hive
on the master node.
In a little while you'll see the public DNS name of your master node. It starts off blank
which means the node hasn't started yet.
You can refresh the page to see the status of your job flow as it starts.
If we look again we'll see that the master has checked in but the job flow is still in
the starting state and not yet ready for jobs.
While the job flow is starting, we can grab the job flow ID and switch to the Elastic
MapReduce command line client.
Using the client, we can list the job flow we are interested in.
This gives us details such as job flow ID, job flow state, master public DNS name, job
flow name, and the status of the various steps.
I should note, I configured my client before the demo.
If this is your first time using the client, you'll need to create a credentials.json file
and include your AWS access key and secret key.
Now the job flow is running and the first step is being executed.
We are looking for the job flow to reach the waiting state and for the setup hive step
to complete.