WordCount Example
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.
To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>
All of the files in the input directory (called in-dir in the
command line above) are read and the counts of words in the input are
written to the output directory (called out-dir above). It is assumed
that both inputs and outputs are stored in HDFS (see ImportantConcepts).
If your input is not already in HDFS, but is rather in a local file
system somewhere, you need to copy the data into HDFS using a command
like this:
bin/hadoop dfs -mkdir <hdfs-dir>
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>
As of version 0.17.2.1, you only need to run a command like this:
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>
Word count supports generic options : see DevelopmentCommandLineOptions
Below is the standard wordcount example implemented in Java:
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapred.*;
10 import org.apache.hadoop.util.*;
11
12 public class WordCount {
13
14 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
15 private final static IntWritable one = new IntWritable(1);
16 private Text word = new Text();
17
18 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
19 String line = value.toString();
20 StringTokenizer tokenizer = new StringTokenizer(line);
21 while (tokenizer.hasMoreTokens()) {
22 word.set(tokenizer.nextToken());
23 output.collect(word, one);
24 }
25 }
26 }
27
28 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
29
30 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
31 int sum = 0;
32 while (values.hasNext()) {
33 sum += values.next().get();
34 }
35 output.collect(key, new IntWritable(sum));
36 }
37 }
38
39 public static void main(String[] args) throws Exception {
40 JobConf conf = new JobConf(WordCount.class);
41 conf.setJobName("wordcount");
42
43 conf.setOutputKeyClass(Text.class);
44 conf.setOutputValueClass(IntWritable.class);
45
46 conf.setMapperClass(Map.class);
47 conf.setCombinerClass(Reduce.class);
48 conf.setReducerClass(Reduce.class);
49
50 conf.setInputFormat(TextInputFormat.class);
51 conf.setOutputFormat(TextOutputFormat.class);
52
53 FileInputFormat.setInputPaths(conf, new Path(args[0]));
54 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
55
56 JobClient.runJob(conf);
57 }
58
59 }