DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

(gawk) Uniq Program

Info Catalog (gawk) Tee Program (gawk) Clones (gawk) Wc Program
 
 Printing Non-duplicated Lines of Text
 -------------------------------------
 
    The `uniq' utility reads sorted lines of data on its standard input,
 and (by default) removes duplicate lines.  In other words, only unique
 lines are printed, hence the name.  `uniq' has a number of options. The
 usage is:
 
      uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]
 
    The option meanings are:
 
 `-d'
      Only print repeated lines.
 
 `-u'
      Only print non-repeated lines.
 
 `-c'
      Count lines. This option overrides `-d' and `-u'.  Both repeated
      and non-repeated lines are counted.
 
 `-N'
      Skip N fields before comparing lines.  The definition of fields is
      similar to `awk''s default: non-whitespace characters separated by
      runs of spaces and/or tabs.
 
 `+N'
      Skip N characters before comparing lines.  Any fields specified
      with `-N' are skipped first.
 
 `INPUT FILE'
      Data is read from the input file named on the command line,
      instead of from the standard input.
 
 `OUTPUT FILE'
      The generated output is sent to the named output file, instead of
      to the standard output.
 
    Normally `uniq' behaves as if both the `-d' and `-u' options had
 been provided.
 
    Here is an `awk' implementation of `uniq'. It uses the `getopt'
DONTPRINTYET  library function ( Processing Command Line Options Getopt
 Function.), and the `join' library function (*note Merging an Array
DONTPRINTYET  library function ( Processing Command Line Options Getopt
 Function.), and the `join' library function ( Merging an Array

 Into a String Join Function.).
 
    The program begins with a `usage' function and then a brief outline
 of the options and their meanings in a comment.
 
    The `BEGIN' rule deals with the command line arguments and options.
 It uses a trick to get `getopt' to handle options of the form `-25',
 treating such an option as the option letter `2' with an argument of
 `5'. If indeed two or more digits were supplied (`Optarg' looks like a
 number), `Optarg' is concatenated with the option digit, and then
 result is added to zero to make it into a number.  If there is only one
 digit in the option, then `Optarg' is not needed, and `Optind' must be
 decremented so that `getopt' will process it next time.  This code is
 admittedly a bit tricky.
 
    If no options were supplied, then the default is taken, to print both
 repeated and non-repeated lines.  The output file, if provided, is
 assigned to `outputfile'.  Earlier, `outputfile' was initialized to the
 standard output, `/dev/stdout'.
 
      # uniq.awk --- do uniq in awk
      # Arnold Robbins, arnold@gnu.org, Public Domain
      # May 1993
      
      function usage(    e)
      {
          e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
          print e > "/dev/stderr"
          exit 1
      }
      
      # -c    count lines. overrides -d and -u
      # -d    only repeated lines
      # -u    only non-repeated lines
      # -n    skip n fields
      # +n    skip n characters, skip fields first
      
      BEGIN   \
      {
          count = 1
          outputfile = "/dev/stdout"
          opts = "udc0:1:2:3:4:5:6:7:8:9:"
          while ((c = getopt(ARGC, ARGV, opts)) != -1) {
              if (c == "u")
                  non_repeated_only++
              else if (c == "d")
                  repeated_only++
              else if (c == "c")
                  do_count++
              else if (index("0123456789", c) != 0) {
                  # getopt requires args to options
                  # this messes us up for things like -5
                  if (Optarg ~ /^[0-9]+$/)
                      fcount = (c Optarg) + 0
                  else {
                      fcount = c + 0
                      Optind--
                  }
              } else
                  usage()
          }
      
          if (ARGV[Optind] ~ /^\+[0-9]+$/) {
              charcount = substr(ARGV[Optind], 2) + 0
              Optind++
          }
      
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
      
          if (repeated_only == 0 && non_repeated_only == 0)
              repeated_only = non_repeated_only = 1
      
          if (ARGC - Optind == 2) {
              outputfile = ARGV[ARGC - 1]
              ARGV[ARGC - 1] = ""
          }
      }
 
    The following function, `are_equal', compares the current line,
 `$0', to the previous line, `last'.  It handles skipping fields and
 characters.
 
    If no field count and no character count were specified, `are_equal'
 simply returns one or zero depending upon the result of a simple string
 comparison of `last' and `$0'.  Otherwise, things get more complicated.
 
    If fields have to be skipped, each line is broken into an array using
 `split' ( Built-in Functions for String Manipulation String
 Functions.), and then the desired fields are joined back into a line
 using `join'.  The joined lines are stored in `clast' and `cline'.  If
 no fields are skipped, `clast' and `cline' are set to `last' and `$0'
 respectively.
 
    Finally, if characters are skipped, `substr' is used to strip off the
 leading `charcount' characters in `clast' and `cline'.  The two strings
 are then compared, and `are_equal' returns the result.
 
      function are_equal(    n, m, clast, cline, alast, aline)
      {
          if (fcount == 0 && charcount == 0)
              return (last == $0)
      
          if (fcount > 0) {
              n = split(last, alast)
              m = split($0, aline)
              clast = join(alast, fcount+1, n)
              cline = join(aline, fcount+1, m)
          } else {
              clast = last
              cline = $0
          }
          if (charcount) {
              clast = substr(clast, charcount + 1)
              cline = substr(cline, charcount + 1)
          }
      
          return (clast == cline)
      }
 
    The following two rules are the body of the program.  The first one
 is executed only for the very first line of data.  It sets `last' equal
 to `$0', so that subsequent lines of text have something to be compared
 to.
 
    The second rule does the work. The variable `equal' will be one or
 zero depending upon the results of `are_equal''s comparison. If `uniq'
 is counting repeated lines, then the `count' variable is incremented if
 the lines are equal. Otherwise the line is printed and `count' is
 reset, since the two lines are not equal.
 
    If `uniq' is not counting, `count' is incremented if the lines are
 equal. Otherwise, if `uniq' is counting repeated lines, and more than
 one line has been seen, or if `uniq' is counting non-repeated lines,
 and only one line has been seen, then the line is printed, and `count'
 is reset.
 
    Finally, similar logic is used in the `END' rule to print the final
 line of input data.
 
      NR == 1 {
          last = $0
          next
      }
      
      {
          equal = are_equal()
      
          if (do_count) {    # overrides -d and -u
              if (equal)
                  count++
              else {
                  printf("%4d %s\n", count, last) > outputfile
                  last = $0
                  count = 1    # reset
              }
              next
          }
      
          if (equal)
              count++
          else {
              if ((repeated_only && count > 1) ||
                  (non_repeated_only && count == 1))
                      print last > outputfile
              last = $0
              count = 1
          }
      }
      
      END {
          if (do_count)
              printf("%4d %s\n", count, last) > outputfile
          else if ((repeated_only && count > 1) ||
                  (non_repeated_only && count == 1))
              print last > outputfile
      }
 
Info Catalog (gawk) Tee Program (gawk) Clones (gawk) Wc Program
automatically generated byinfo2html