The first MapReduce program most of the people write after installing Hadoop is invariably the word count MapReduce program.
That’s what this post shows, detailed steps for writing word count MapReduce program in Java, IDE used is Eclipse.
Creating and copying input file to HDFS
If you already have a file in HDFS which you want to use as input then you can skip this step.
First thing is to create a file which will be used as input and copy it to HDFS.
Let’s say you have a file wordcount.txt with the following content.
Hello wordcount MapReduce Hadoop program. This is my first MapReduce program.
You want to copy this file to /user/process directory with in HDFS. If that path doesn’t exist then you need to create those directories first.
hdfs dfs -mkdir -p /user/process
- Refer HDFS Commands Reference List for HDFS commands.
Then copy the file wordcount.txt to this directory.
hdfs dfs -put /netjs/MapReduce/wordcount.txt /user/process
Word count MapReduce example Java program
Now you can write your wordcount MapReduce code. WordCount example reads text files and counts the frequency of the words. Each mapper takes a line of the input file as input and breaks it into words. It then emits a key/value pair of the word (In the form of (word, 1)) and each reducer sums the counts for each word and emits a single key/value with the word and sum.
- Refer How MapReduce Works in Hadoop to see in detail how data is processed as (key, value) pairs in map and reduce tasks.
In the word count MapReduce code there is a Mapper class (MyMapper) with map function and a Reducer class (MyReducer) with a reduce function.
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { // Map function public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Splitting the line on spaces String[] stringArr = value.toString().split("\\s+"); for (String str : stringArr) { word.set(str); context.write(word, new IntWritable(1)); } } } // Reduce function public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception{ Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "WC"); job.setJarByClass(WordCount.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Required jars for Hadoop MapReduce code
You will also need to add at least the following Hadoop jars so that your code can compile. You will find these jars inside the /share/hadoop directory of your Hadoop installation. With in /share/hadoop path look in hdfs, mapreduce and common directories for required jars.
hadoop-common-2.9.0.jar hadoop-hdfs-2.9.0.jar hadoop-hdfs-client-2.9.0.jar hadoop-mapreduce-client-core-2.9.0.jar hadoop-mapreduce-client-common-2.9.0.jar hadoop-mapreduce-client-jobclient-2.9.0.jar hadoop-mapreduce-client-hs-2.9.0.jar hadoop-mapreduce-client-app-2.9.0.jar commons-io-2.4.jar
Creating jar of your wordcount MapReduce code
Once you are able to compile your code you need to create jar file. In the eclipse IDE righ click on your Java program and select Export – Java – jar file.
Running the MapReduce code
You can use the following command to run the program. Assuming you are in your hadoop installation directory.
bin/hadoop jar /netjs/MapReduce/wordcount.jar org.netjs.WordCount /user/process /user/out
Explanation for the arguments passed is as follows-
/netjs/MapReduce/wordcount.jar is the path to your jar file.
org.netjs.WordCount is the fully qualified path to your Java program class.
/user/process – path to input directory.
/user/out – path to output directory.
One your word count MapReduce program is succesfully executed you can verify the output file.
hdfs dfs -ls /user/out Found 2 items -rw-r--r-- 1 netjs supergroup 0 2018-02-27 13:37 /user/out/_SUCCESS -rw-r--r-- 1 netjs supergroup 77 2018-02-27 13:37 /user/out/part-r-00000
As you can see Hadoop framework creates output files using part-r-xxxx format. Since only one reducer is used here so there is only one output file part-r-00000. You can see the content of the file using the following command.
hdfs dfs -cat /user/out/part-r-00000 Hadoop 1 Hello 1 MapReduce 2 This 1 first 1 is 1 my 1 program. 2 wordcount 1
That's all for this topic Word Count MapReduce Program in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Hadoop Framework Tutorial Page
Related Topics
You may also like-