WordCount - Hadoop Wiki

WordCount Example

WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.

To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>

All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above). It is assumed that both inputs and outputs are stored in HDFS (see ImportantConcepts). If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS using a command like this:

bin/hadoop dfs -mkdir <hdfs-dir>
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

As of version 0.17.2.1, you only need to run a command like this:
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

Word count supports generic options : see DevelopmentCommandLineOptions

Below is the standard wordcount example implemented in Java:

Toggle line numbers

   1 package org.myorg;
   2         
   3 import java.io.IOException;
   4 import java.util.*;
   5         
   6 import org.apache.hadoop.fs.Path;
   7 import org.apache.hadoop.conf.*;
   8 import org.apache.hadoop.io.*;
   9 import org.apache.hadoop.mapred.*;
  10 import org.apache.hadoop.util.*;
  11         
  12 public class WordCount {
  13         
  14  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  15     private final static IntWritable one = new IntWritable(1);
  16     private Text word = new Text();
  17         
  18     public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
  19         String line = value.toString();
  20         StringTokenizer tokenizer = new StringTokenizer(line);
  21         while (tokenizer.hasMoreTokens()) {
  22             word.set(tokenizer.nextToken());
  23             output.collect(word, one);
  24         }
  25     }
  26  } 
  27         
  28  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  29 
  30     public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
  31         int sum = 0;
  32         while (values.hasNext()) {
  33             sum += values.next().get();
  34         }
  35         output.collect(key, new IntWritable(sum));
  36     }
  37  }
  38         
  39  public static void main(String[] args) throws Exception {
  40     JobConf conf = new JobConf(WordCount.class);
  41     conf.setJobName("wordcount");
  42         
  43     conf.setOutputKeyClass(Text.class);
  44     conf.setOutputValueClass(IntWritable.class);
  45         
  46     conf.setMapperClass(Map.class);
  47     conf.setCombinerClass(Reduce.class);
  48     conf.setReducerClass(Reduce.class);
  49         
  50     conf.setInputFormat(TextInputFormat.class);
  51     conf.setOutputFormat(TextOutputFormat.class);
  52         
  53     FileInputFormat.setInputPaths(conf, new Path(args[0]));
  54     FileOutputFormat.setOutputPath(conf, new Path(args[1]));
  55         
  56     JobClient.runJob(conf);
  57  }
  58         
  59 }