DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

(gawk.info) Word Sorting

Info Catalog (gawk.info) Labels Program (gawk.info) Miscellaneous Programs (gawk.info) History Sorting
 
 Generating Word Usage Counts
 ----------------------------
 
    The following `awk' program prints the number of occurrences of each
 word in its input.  It illustrates the associative nature of `awk'
 arrays by using strings as subscripts.  It also demonstrates the `for X
 in ARRAY' construction.  Finally, it shows how `awk' can be used in
 conjunction with other utility programs to do a useful task of some
 complexity with a minimum of effort.  Some explanations follow the
 program listing.
 
      awk '
      # Print list of word frequencies
      {
          for (i = 1; i <= NF; i++)
              freq[$i]++
      }
      
      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }'
 
    The first thing to notice about this program is that it has two
 rules.  The first rule, because it has an empty pattern, is executed on
 every line of the input.  It uses `awk''s field-accessing mechanism
 ( Examining Fields Fields.) to pick out the individual words from
 the line, and the built-in variable `NF' ( Built-in Variables)
 to know how many fields are available.
 
    For each input word, an element of the array `freq' is incremented to
 reflect that the word has been seen an additional time.
 
    The second rule, because it has the pattern `END', is not executed
 until the input has been exhausted.  It prints out the contents of the
 `freq' table that has been built up inside the first action.
 
    This program has several problems that would prevent it from being
 useful by itself on real text files:
 
    * Words are detected using the `awk' convention that fields are
      separated by whitespace and that other characters in the input
      (except newlines) don't have any special meaning to `awk'.  This
      means that punctuation characters count as part of words.
 
    * The `awk' language considers upper- and lower-case characters to be
      distinct.  Therefore, `bartender' and `Bartender' are not treated
      as the same word.  This is undesirable since, in normal text, words
      are capitalized if they begin sentences, and a frequency analyzer
      should not be sensitive to capitalization.
 
    * The output does not come out in any useful order.  You're more
      likely to be interested in which words occur most frequently, or
      having an alphabetized table of how frequently each word occurs.
 
    The way to solve these problems is to use some of the more advanced
 features of the `awk' language.  First, we use `tolower' to remove case
 distinctions.  Next, we use `gsub' to remove punctuation characters.
 Finally, we use the system `sort' utility to process the output of the
 `awk' script.  Here is the new version of the program:
 
      # Print list of word frequencies
      {
          $0 = tolower($0)    # remove case distinctions
          gsub(/[^a-z0-9_ \t]/, "", $0)  # remove punctuation
          for (i = 1; i <= NF; i++)
              freq[$i]++
      }
      
      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }
 
    Assuming we have saved this program in a file named `wordfreq.awk',
 and that the data is in `file1', the following pipeline
 
      awk -f wordfreq.awk file1 | sort +1 -nr
 
 produces a table of the words appearing in `file1' in order of
 decreasing frequency.
 
    The `awk' program suitably massages the data and produces a word
 frequency table, which is not ordered.
 
    The `awk' script's output is then sorted by the `sort' utility and
 printed on the terminal.  The options given to `sort' in this example
 specify to sort using the second field of each input line (skipping one
 field), that the sort keys should be treated as numeric quantities
 (otherwise `15' would come before `5'), and that the sorting should be
 done in descending (reverse) order.
 
    We could have even done the `sort' from within the program, by
 changing the `END' action to:
 
      END {
          sort = "sort +1 -nr"
          for (word in freq)
              printf "%s\t%d\n", word, freq[word] | sort
          close(sort)
      }
 
    You would have to use this way of sorting on systems that do not
 have true pipes.
 
    See the general operating system documentation for more information
 on how to use the `sort' program.
 
Info Catalog (gawk.info) Labels Program (gawk.info) Miscellaneous Programs (gawk.info) History Sorting
automatically generated byinfo2html