DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

(gawk) Cut Program

Info Catalog (gawk) Clones (gawk) Clones (gawk) Egrep Program
 
 Cutting Out Fields and Columns
 ------------------------------
 
    The `cut' utility selects, or "cuts," either characters or fields
 from its standard input and sends them to its standard output.  `cut'
 can cut out either a list of characters, or a list of fields.  By
 default, fields are separated by tabs, but you may supply a command
 line option to change the field "delimiter", i.e. the field separator
 character. `cut''s definition of fields is less general than `awk''s.
 
    A common use of `cut' might be to pull out just the login name of
 logged-on users from the output of `who'.  For example, the following
 pipeline generates a sorted, unique list of the logged on users:
 
      who | cut -c1-8 | sort | uniq
 
    The options for `cut' are:
 
 `-c LIST'
      Use LIST as the list of characters to cut out.  Items within the
      list may be separated by commas, and ranges of characters can be
      separated with dashes.  The list `1-8,15,22-35' specifies
      characters one through eight, 15, and 22 through 35.
 
 `-f LIST'
      Use LIST as the list of fields to cut out.
 
 `-d DELIM'
      Use DELIM as the field separator character instead of the tab
      character.
 
 `-s'
      Suppress printing of lines that do not contain the field delimiter.
 
    The `awk' implementation of `cut' uses the `getopt' library function
 ( Processing Command Line Options Getopt Function.), and the
 `join' library function ( Merging an Array Into a String Join
 Function.).
 
    The program begins with a comment describing the options and a
 `usage' function which prints out a usage message and exits.  `usage'
 is called if invalid arguments are supplied.
 
      # cut.awk --- implement cut in awk
      # Arnold Robbins, arnold@gnu.org, Public Domain
      # May 1993
      
      # Options:
      #    -f list        Cut fields
      #    -d c           Field delimiter character
      #    -c list        Cut characters
      #
      #    -s        Suppress lines without the delimiter character
      
      function usage(    e1, e2)
      {
          e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
          e2 = "usage: cut [-c list] [files...]"
          print e1 > "/dev/stderr"
          print e2 > "/dev/stderr"
          exit 1
      }
 
 The variables `e1' and `e2' are used so that the function fits nicely
 on the screen.
 
    Next comes a `BEGIN' rule that parses the command line options.  It
 sets `FS' to a single tab character, since that is `cut''s default
 field separator.  The output field separator is also set to be the same
 as the input field separator.  Then `getopt' is used to step through
 the command line options.  One or the other of the variables
 `by_fields' or `by_chars' is set to true, to indicate that processing
 should be done by fields or by characters respectively.  When cutting
 by characters, the output field separator is set to the null string.
 
      BEGIN    \
      {
          FS = "\t"    # default
          OFS = FS
          while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) {
              if (c == "f") {
                  by_fields = 1
                  fieldlist = Optarg
              } else if (c == "c") {
                  by_chars = 1
                  fieldlist = Optarg
                  OFS = ""
              } else if (c == "d") {
                  if (length(Optarg) > 1) {
                      printf("Using first character of %s" \
                      " for delimiter\n", Optarg) > "/dev/stderr"
                      Optarg = substr(Optarg, 1, 1)
                  }
                  FS = Optarg
                  OFS = FS
                  if (FS == " ")    # defeat awk semantics
                      FS = "[ ]"
              } else if (c == "s")
                  suppress++
              else
                  usage()
          }
      
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
 
    Special care is taken when the field delimiter is a space. Using
 `" "' (a single space) for the value of `FS' is incorrect--`awk' would
 separate fields with runs of spaces, tabs and/or newlines, and we want
 them to be separated with individual spaces.  Also, note that after
 `getopt' is through, we have to clear out all the elements of `ARGV'
 from one to `Optind', so that `awk' will not try to process the command
 line options as file names.
 
    After dealing with the command line options, the program verifies
 that the options make sense.  Only one or the other of `-c' and `-f'
 should be used, and both require a field list.  Then either
 `set_fieldlist' or `set_charlist' is called to pull apart the list of
 fields or characters.
 
          if (by_fields && by_chars)
              usage()
      
          if (by_fields == 0 && by_chars == 0)
              by_fields = 1    # default
      
          if (fieldlist == "") {
              print "cut: needs list for -c or -f" > "/dev/stderr"
              exit 1
          }
      
          if (by_fields)
              set_fieldlist()
          else
              set_charlist()
      }
 
    Here is `set_fieldlist'.  It first splits the field list apart at
 the commas, into an array.  Then, for each element of the array, it
 looks to see if it is actually a range, and if so splits it apart. The
 range is verified to make sure the first number is smaller than the
 second.  Each number in the list is added to the `flist' array, which
 simply lists the fields that will be printed.  Normal field splitting
 is used.  The program lets `awk' handle the job of doing the field
 splitting.
 
      function set_fieldlist(        n, m, i, j, k, f, g)
      {
          n = split(fieldlist, f, ",")
          j = 1    # index in flist
          for (i = 1; i <= n; i++) {
              if (index(f[i], "-") != 0) { # a range
                  m = split(f[i], g, "-")
                  if (m != 2 || g[1] >= g[2]) {
                      printf("bad field list: %s\n",
                                        f[i]) > "/dev/stderr"
                      exit 1
                  }
                  for (k = g[1]; k <= g[2]; k++)
                      flist[j++] = k
              } else
                  flist[j++] = f[i]
          }
          nfields = j - 1
      }
 
    The `set_charlist' function is more complicated than `set_fieldlist'.
 The idea here is to use `gawk''s `FIELDWIDTHS' variable ( Reading
 Fixed-width Data Constant Size.), which describes constant width
 input.  When using a character list, that is exactly what we have.
 
    Setting up `FIELDWIDTHS' is more complicated than simply listing the
 fields that need to be printed.  We have to keep track of the fields to
 be printed, and also the intervening characters that have to be skipped.
 For example, suppose you wanted characters one through eight, 15, and
 22 through 35.  You would use `-c 1-8,15,22-35'.  The necessary value
 for `FIELDWIDTHS' would be `"8 6 1 6 14"'.  This gives us five fields,
 and what should be printed are `$1', `$3', and `$5'.  The intermediate
 fields are "filler," stuff in between the desired data.
 
    `flist' lists the fields to be printed, and `t' tracks the complete
 field list, including filler fields.
 
      function set_charlist(    field, i, j, f, g, t,
                                filler, last, len)
      {
          field = 1   # count total fields
          n = split(fieldlist, f, ",")
          j = 1       # index in flist
          for (i = 1; i <= n; i++) {
              if (index(f[i], "-") != 0) { # range
                  m = split(f[i], g, "-")
                  if (m != 2 || g[1] >= g[2]) {
                      printf("bad character list: %s\n",
                                     f[i]) > "/dev/stderr"
                      exit 1
                  }
                  len = g[2] - g[1] + 1
                  if (g[1] > 1)  # compute length of filler
                      filler = g[1] - last - 1
                  else
                      filler = 0
                  if (filler)
                      t[field++] = filler
                  t[field++] = len  # length of field
                  last = g[2]
                  flist[j++] = field - 1
              } else {
                  if (f[i] > 1)
                      filler = f[i] - last - 1
                  else
                      filler = 0
                  if (filler)
                      t[field++] = filler
                  t[field++] = 1
                  last = f[i]
                  flist[j++] = field - 1
              }
          }
          FIELDWIDTHS = join(t, 1, field - 1)
          nfields = j - 1
      }
 
    Here is the rule that actually processes the data.  If the `-s'
 option was given, then `suppress' will be true.  The first `if'
 statement makes sure that the input record does have the field
 separator.  If `cut' is processing fields, `suppress' is true, and the
 field separator character is not in the record, then the record is
 skipped.
 
    If the record is valid, then at this point, `gawk' has split the data
 into fields, either using the character in `FS' or using fixed-length
 fields and `FIELDWIDTHS'.  The loop goes through the list of fields
 that should be printed.  If the corresponding field has data in it, it
 is printed.  If the next field also has data, then the separator
 character is written out in between the fields.
 
      {
          if (by_fields && suppress && $0 !~ FS)
              next
      
          for (i = 1; i <= nfields; i++) {
              if ($flist[i] != "") {
                  printf "%s", $flist[i]
                  if (i < nfields && $flist[i+1] != "")
                      printf "%s", OFS
              }
          }
          print ""
      }
 
    This version of `cut' relies on `gawk''s `FIELDWIDTHS' variable to
 do the character-based cutting.  While it would be possible in other
 `awk' implementations to use `substr' ( Built-in Functions for
 String Manipulation String Functions.), it would also be extremely
 painful to do so.  The `FIELDWIDTHS' variable supplies an elegant
 solution to the problem of picking the input line apart by characters.
 
Info Catalog (gawk) Clones (gawk) Clones (gawk) Egrep Program
automatically generated byinfo2html