Run Hadoop Jobs On E-MapReduce
1. Experiment
1.1 Knowledge points
This experiment describes how to run Hadoop jobs on Alibaba Cloud E-MapReduce. E-MapReduce is a systematic solution for big data processing. It runs on Alibaba Cloud Elastic Compute Service (ECS) based on open-source Apache Hadoop and Apache Spark. You can analyze and process your data by using peripheral systems (such as Apache Hive, Apache Pig, and HBase) in the Hadoop and Spark ecosystems. This experiment describes how to run Hadoop jobs in a created E-MapReduce cluster in command line interface (CLI) mode.
1.2 Experiment process
- Log on to an E-MapReduce cluster.
- Prepare the Hadoop environment.
- Run the word statistics job.
- Run the keyword query job.
- Run the job that generates a random text file of a specified size.
1.3 Cloud resources required
1.4 Prerequisites
- You understand ECS and Server Load Balancer (SLB).
- If you’re using your own Alibaba Cloud account instead of the account provided by this lab to operate the experiment, please note that you’ll need to choose the same Ubuntu 16.04 operating system for your ECS in order to run the experiment smoothly.
2. Start the experiment environment
Click Start Lab in the upper right corner of the page to start the experiment.
.
After the experiment environment is successfully started, the system has deployed resources required by this experiment in the background, including the ECS instance, RDS instance, Server Load Balancer instance, and OSS bucket. An account consisting of the username and password for logging on to the Web console of Alibaba Cloud is also provided.
After the experiment environment is started and related resources are properly deployed, the experiment starts a countdown. You have two hours to perform experimental operations. After the countdown ends, the experiment stops, and related resources are released. During the experiment, pay attention to the remaining time and arrange your time wisely. Next, use the username and password provided by the system to log on to the Web console of Alibaba Cloud and view related resources:
Go to the logon page of Alibaba Cloud console.
Fill in the sub-user account and click Next.
Fill in the sub-user password and click Log on.
After you successfully log on to the console, the following page is displayed.
3. E-MapReduce service introduction
3.1 Background
To use distributed processing systems, such as Hadoop and Spark, you need to perform the following steps:
- Evaluate business features.
- Select the server type.
- Purchase servers.
- Prepare the hardware environment.
- Install the operating system.
- Deploy apps, such as Hadoop and Spark.
- Start an E-MapReduce cluster.
- Compile apps.
- Run jobs.
- Obtain data and perform other operations.
Steps 1 to 7 are time-consuming and complex preparations. Steps 8 to 10 are related to your business logic. E-MapReduce is a solution that integrates with cluster management tools, such as host selection, environment deployment, cluster building, configuration, and running, job configuration, running, and management, and performance monitoring.
E-MapReduce spares you from cluster building operations, such as purchase, preparation, and maintenance, so that you only need to focus on the processing logic of your apps. In addition, E-MapReduce provides flexible combinations of cluster services for you to select based on your business features. For example, if you only require daily statistics collection and simple data computing, you can run the Hadoop service in E-MapReduce. If you also require stream computing and real-time computing, you can run the Hadoop and Spark services.
3.2 E-MapReduce components
The core component of E-MapReduce is the cluster that users face directly. An E-MapReduce cluster consists of one or more ECS instances and provides both Hadoop and Spark services. Take Hadoop as an example. Each ECS instance runs some daemon processes (for example, namenode, datanode, resourcemanager, and nodemanager). These daemon processes form a Hadoop cluster. namenode and resourcemanager run on a master node, and datanode and nodemanager run on a slave node.
The following figure shows an E-MapReduce cluster that contains one master node and three slave nodes.
4. Run Hadoop jobs
4.1 Log on to an E-MapReduce cluster
Click E-MapReduce.
Select US (Silicon Valley). You can see that an E-MapReduce cluster is being created. Since the creation of the cluster takes time, the user may have to wait about 10 minutes. After the creation is complete, the user can see that the status of the cluster is displayed as “Idle.”
Select Cluster Management and click ID to access the cluster console.
Click Cluster Overview.
The cluster consists of one master node and two core nodes and has the HDFS, YARN, Hive, Ganglia, Spark, HUE, Tez, Sqoop, Pig, ApacheDS, and Knox services installed.
Click Master ECS ID.
Click Add Rules.
Copy the public IP address of the master node in the preceding figure and remotely log on to the master node. For more information about logon, visit logon.
The default account name and password of the ECS instance:
Account name: root
Password: Aliyun123!@#
4.2 Prepare the Hadoop environment
After logging on to the master node, run the following command to download the JAR package for running Hadoop jobs: The JAR package records the frequency of each word in all files stored in the input directory.
wget https://labex-ali-data.oss-us-west-1.aliyuncs.com/emr/hadoop-mapreduce-examples-2.7.2.jar
Run the following command to download a story file for statistics:
wget https://labex-ali-data.oss-us-west-1.aliyuncs.com/emr/fiction.txt
Run the following command to view the directory structure of the Hadoop file system:
hadoop fs -ls /
Run the following commands to create an input directory in the Hadoop file system:
hadoop fs -mkdir /input
hadoop fs -ls /
Run the following commands to create an output directory in the Hadoop file system:
hadoop fs -mkdir /output
hadoop fs -ls /
Run the following commands to upload the downloaded story file to the Hadoop file system:
hadoop fs -put fiction.txt /input/
hadoop fs -ls /input
4.3 Run the word statistics job
Run the following command to run the word statistics job:
hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /input/ /output/res
After the job is run, the statistics results are stored in the /output/res directory of the Hadoop file system.
Run the following command to view content in the /output/res directory:
hadoop fs -ls /output/res
The word statistics results are stored in the files whose names start with “part”.
Run the following commands to download the part-r-00002 file to a local PC:
hadoop fs -get /output/res/part-r-00002 .
ls
Run the vim part-r-00002 command to open the file. Words and their frequencies are displayed.
4.4 Run the keyword query job
Run the following command to view the quantity of “Harry” in the story file and save the result file to the /output/res1 directory of the Hadoop file system:
hadoop jar hadoop-mapreduce-examples-2.7.2.jar grep /input/ /output/res1 Harry
Run the following command. The result file appears.
hadoop fs -ls /output/res1
Run the following command to view the result file:
hadoop fs -cat /output/res1/part-r-00000
You can view the frequency of “Harry.”
4.5 Run the job that generates a random text file of a specified size
Run the following command to generate a random text file whose size is 100 MB and save the file to the /output/res2 directory of the Hadoop file system:
hadoop jar hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000 /output/res2
Run the following command to view the result file:
hadoop fs -ls /output/res2
Run the following command to download the result file to a local PC:
hadoop fs -get /output/res2/part-m-00000 .
ls
Run the vim part-m-00000 command to open the file.
You can view the file content, most of which is meaningless words.
Reminder:
Before you leave this lab, remember to log out your Alibaba RAM account before you click the ‘stop’ button of your lab. Otherwise you’ll encounter some issues when opening a new lab session in the same browser:
5. Experiment summary
This experiment describes how to run Hadoop jobs in an E-MapReduce cluster. Compared with a self-built cluster, an E-MapReduce cluster is more economical and easier to use. It not only reduces cluster maintenance costs but also spares you from cluster building operations, such as purchase, preparation, and O&M. The E-MapReduce cluster enables you to focus on program design related to the business logic and greatly improves your work efficiency.