About Piyas De
We will describe here the process to run MapReduce Job in Apache Hadoop in multinode cluster. To set up Apache Hadoop in Multinode Cluster, one can read Setting up Apache Hadoop Multi – Node Cluster.
For setting up we have to configure the hadoop with the following in each machine:
- Add the following property in conf/mapred-site.xml in all the nodes
<property><name>mapred.job.tracker</name><value>master:54311</value><description>The host and port that the MapReduce job tracker runsat. If “local”, then jobs are run in-process as a single mapand reduce task.</description></property><property><name>mapred.local.dir</name><value>${hadoop.tmp.dir}/mapred/local</value></property><property><name>mapred.map.tasks</name><value>20</value></property><property><name>mapred.reduce.tasks</name><value>2</value></property> N.B. The last three are additional setting, so we can omit them.
- The Gutenberg Project
For our demo purpose of MapReduce we will be using the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
Download the example inputs from the following sites, and all e-texts should be in plain text us-ascii encoding.
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
- The Art of War by 6th cent. B.C. Sunzi
- The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
- The Devil’s Dictionary by Ambrose Bierce
- Encyclopaedia Britannica, 11th Edition, Volume 4, Part 3
Please google for those texts. Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg. Check the files with the following command:
$ ls -l /tmp/gutenberg/
- Next we start the dfs and mapred layer in our cluster
$ start-dfs.sh$ start-mapred.sh
Check by issuing the following command jps to check the datanodes, namenodes, and task tracker, job tracker is all up and running fine in all the nodes.
- Next we will copy the local files(here the text files) to Hadoop HDFS
$ hadoop dfs -copyFromLocal /tmp/gutenberg /Users/hduser/gutenberg$ hadoop dfs -ls /Users/hduser
If the files are successfully copied we will see something like below – Found 2 items
drwxr-xr-x – hduser supergroup 0 2013-05-21 14:48 /Users/hduser/gutenberg
Furthermore we check our file system whats in /Users/hduser/gutenberg:
$ hadoop dfs -ls /Users/hduser/gutenberg
Found 7 items-rw-r--r-- 2 hduser supergroup 336705 2013-05-21 14:48 /Users/hduser/gutenberg/pg132.txt-rw-r--r-- 2 hduser supergroup 581877 2013-05-21 14:48 /Users/hduser/gutenberg/pg1661.txt-rw-r--r-- 2 hduser supergroup 1916261 2013-05-21 14:48 /Users/hduser/gutenberg/pg19699.txt-rw-r--r-- 2 hduser supergroup 674570 2013-05-21 14:48 /Users/hduser/gutenberg/pg20417.txt-rw-r--r-- 2 hduser supergroup 1540091 2013-05-21 14:48 /Users/hduser/gutenberg/pg4300.txt-rw-r--r-- 2 hduser supergroup 447582 2013-05-21 14:48 /Users/hduser/gutenberg/pg5000.txt-rw-r--r-- 2 hduser supergroup 384408 2013-05-21 14:48 /Users/hduser/gutenberg/pg972.txt
- We start our MapReduce Job
Let us run the MapReduce WordCount example:
$ hadoop jar hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output
N.B.: Assuming that you are already in the HADOOP_HOME dir. If not then,
$ hadoop jar ABSOLUTE/PATH/TO/HADOOP/DIR/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output
Or if you have installed the Hadoop in /usr/local/hadoop then,
hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /Users/hduser/gutenberg /Users/hduser/gutenberg-output
The output is follows something like:
13/05/22 13:12:13 INFO mapred.JobClient: map 0% reduce 0%13/05/22 13:12:59 INFO mapred.JobClient: map 28% reduce 0%13/05/22 13:13:05 INFO mapred.JobClient: map 57% reduce 0%13/05/22 13:13:11 INFO mapred.JobClient: map 71% reduce 0%13/05/22 13:13:20 INFO mapred.JobClient: map 85% reduce 0%13/05/22 13:13:26 INFO mapred.JobClient: map 100% reduce 0%13/05/22 13:13:43 INFO mapred.JobClient: map 100% reduce 50%13/05/22 13:13:55 INFO mapred.JobClient: map 100% reduce 100%13/05/22 13:13:59 INFO mapred.JobClient: map 85% reduce 100%13/05/22 13:14:02 INFO mapred.JobClient: map 100% reduce 100%13/05/22 13:14:07 INFO mapred.JobClient: Job complete: job_201305211616_001113/05/22 13:14:07 INFO mapred.JobClient: Counters: 2613/05/22 13:14:07 INFO mapred.JobClient: Job Counters 13/05/22 13:14:07 INFO mapred.JobClient: Launched reduce tasks=313/05/22 13:14:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11892013/05/22 13:14:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/05/22 13:14:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/05/22 13:14:07 INFO mapred.JobClient: Launched map tasks=1013/05/22 13:14:07 INFO mapred.JobClient: Data-local map tasks=1013/05/22 13:14:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=5462013/05/22 13:14:07 INFO mapred.JobClient: File Output Format Counters 13/05/22 13:14:07 INFO mapred.JobClient: Bytes Written=126728713/05/22 13:14:07 INFO mapred.JobClient: FileSystemCounters13/05/22 13:14:07 INFO mapred.JobClient: FILE_BYTES_READ=415112313/05/22 13:14:07 INFO mapred.JobClient: HDFS_BYTES_READ=588232013/05/22 13:14:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=693708413/05/22 13:14:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=126728713/05/22 13:14:07 INFO mapred.JobClient: File Input Format Counters 13/05/22 13:14:07 INFO mapred.JobClient: Bytes Read=588149413/05/22 13:14:07 INFO mapred.JobClient: Map-Reduce Framework13/05/22 13:14:07 INFO mapred.JobClient: Reduce input groups=11490113/05/22 13:14:07 INFO mapred.JobClient: Map output materialized bytes=259763013/05/22 13:14:07 INFO mapred.JobClient: Combine output records=17879513/05/22 13:14:07 INFO mapred.JobClient: Map input records=11525113/05/22 13:14:07 INFO mapred.JobClient: Reduce shuffle bytes=185712313/05/22 13:14:07 INFO mapred.JobClient: Reduce output records=11490113/05/22 13:14:07 INFO mapred.JobClient: Spilled Records=46342713/05/22 13:14:07 INFO mapred.JobClient: Map output bytes=982118013/05/22 13:14:07 INFO mapred.JobClient: Total committed heap usage (bytes)=156751462413/05/22 13:14:07 INFO mapred.JobClient: Combine input records=100555413/05/22 13:14:07 INFO mapred.JobClient: Map output records=100555413/05/22 13:14:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=82613/05/22 13:14:07 INFO mapred.JobClient: Reduce input records=178795
- Retrieving the Job Result
To read directly from hadoop without copying to local file system:
$ hadoop dfs -cat /Users/hduser/gutenberg-output/part-r-00000
Let us copy the the results to the local file system though.
$ mkdir /tmp/gutenberg-output$ bin/hadoop dfs -getmerge /Users/hduser/gutenberg-output /tmp/gutenberg-output$ head /tmp/gutenberg-output/gutenberg-output
We will get a output as:
"'Ample.' 1"'Arthur!' 1"'As 1"'Because 1"'But,' 1"'Certainly,' 1"'Come, 1"'DEAR 1"'Dear 2"'Dearest 1"'Don't 1"'Fritz! 1"'From 1"'Have 1"'Here 1"'How 2
The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Resources:
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
- http://hadoop.apache.org/docs/current/
Source : http://www.javacodegeeks.com/2013/06/running-map-reduce-job-in-apache-hadoop-multinode-cluster.html