DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

(gawk) Records

Info Catalog (gawk) Reading Files (gawk) Reading Files (gawk) Fields
 
 How Input is Split into Records
 ===============================
 
    The `awk' utility divides the input for your `awk' program into
 records and fields.  Records are separated by a character called the
 "record separator".  By default, the record separator is the newline
 character.  This is why records are, by default, single lines.  You can
 use a different character for the record separator by assigning the
 character to the built-in variable `RS'.
 
    You can change the value of `RS' in the `awk' program, like any
 other variable, with the assignment operator, `=' ( Assignment
 Expressions Assignment Ops.).  The new record-separator character
 should be enclosed in quotation marks, which indicate a string
 constant.  Often the right time to do this is at the beginning of
 execution, before any input has been processed, so that the very first
 record will be read with the proper separator.  To do this, use the
 special `BEGIN' pattern ( The `BEGIN' and `END' Special Patterns
 BEGIN/END.).  For example:
 
      awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list
 
 changes the value of `RS' to `"/"', before reading any input.  This is
 a string whose first character is a slash; as a result, records are
 separated by slashes.  Then the input file is read, and the second rule
 in the `awk' program (the action with no pattern) prints each record.
 Since each `print' statement adds a newline at the end of its output,
 the effect of this `awk' program is to copy the input with each slash
 changed to a newline.  Here are the results of running the program on
 `BBS-list':
 
      $ awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list
      -| aardvark     555-5553     1200
      -| 300          B
      -| alpo-net     555-3412     2400
      -| 1200
      -| 300     A
      -| barfly       555-7685     1200
      -| 300          A
      -| bites        555-1675     2400
      -| 1200
      -| 300     A
      -| camelot      555-0542     300               C
      -| core         555-2912     1200
      -| 300          C
      -| fooey        555-1234     2400
      -| 1200
      -| 300     B
      -| foot         555-6699     1200
      -| 300          B
      -| macfoo       555-6480     1200
      -| 300          A
      -| sdace        555-3430     2400
      -| 1200
      -| 300     A
      -| sabafoo      555-2127     1200
      -| 300          C
      -|
 
 Note that the entry for the `camelot' BBS is not split.  In the
 original data file ( Data Files for the Examples Sample Data
 Files.), the line looks like this:
 
      camelot      555-0542     300               C
 
 It only has one baud rate; there are no slashes in the record.
 
    Another way to change the record separator is on the command line,
 using the variable-assignment feature ( Other Command Line
 Arguments Other Arguments.).
 
      awk '{ print $0 }' RS="/" BBS-list
 
 This sets `RS' to `/' before processing `BBS-list'.
 
    Using an unusual character such as `/' for the record separator
 produces correct behavior in the vast majority of cases.  However, the
 following (extreme) pipeline prints a surprising `1'.  There is one
 field, consisting of a newline.  The value of the built-in variable
 `NF' is the number of fields in the current record.
 
      $ echo | awk 'BEGIN { RS = "a" } ; { print NF }'
      -| 1
 
 Reaching the end of an input file terminates the current input record,
 even if the last character in the file is not the character in `RS'
 (d.c.).
 
    The empty string, `""' (a string of no characters), has a special
 meaning as the value of `RS': it means that records are separated by
 one or more blank lines, and nothing else.   Multiple-Line
 Records Multiple Line, for more details.
 
    If you change the value of `RS' in the middle of an `awk' run, the
 new value is used to delimit subsequent records, but the record
 currently being processed (and records already processed) are not
 affected.
 
    After the end of the record has been determined, `gawk' sets the
 variable `RT' to the text in the input that matched `RS'.
 
    The value of `RS' is in fact not limited to a one-character string.
 It can be any regular expression ( Regular Expressions Regexp.).
 In general, each record ends at the next string that matches the
 regular expression; the next record starts at the end of the matching
 string.  This general rule is actually at work in the usual case, where
 `RS' contains just a newline: a record ends at the beginning of the
 next matching string (the next newline in the input) and the following
 record starts just after the end of this string (at the first character
 of the following line).  The newline, since it matches `RS', is not
 part of either record.
 
    When `RS' is a single character, `RT' will contain the same single
 character. However, when `RS' is a regular expression, then `RT'
 becomes more useful; it contains the actual input text that matched the
 regular expression.
 
    The following example illustrates both of these features.  It sets
 `RS' equal to a regular expression that matches either a newline, or a
 series of one or more upper-case letters with optional leading and/or
 trailing white space ( Regular Expressions Regexp.).
 
      $ echo record 1 AAAA record 2 BBBB record 3 |
      > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
      >             { print "Record =", $0, "and RT =", RT }'
      -| Record = record 1 and RT =  AAAA
      -| Record = record 2 and RT =  BBBB
      -| Record = record 3 and RT =
      -|
 
 The final line of output has an extra blank line. This is because the
 value of `RT' is a newline, and then the `print' statement supplies its
 own terminating newline.
 
     A Simple Stream Editor Simple Sed, for a more useful example
 of `RS' as a regexp and `RT'.
 
    The use of `RS' as a regular expression and the `RT' variable are
 `gawk' extensions; they are not available in compatibility mode (
 Command Line Options Options.).  In compatibility mode, only the first
 character of the value of `RS' is used to determine the end of the
 record.
 
    The `awk' utility keeps track of the number of records that have
 been read so far from the current input file.  This value is stored in a
 built-in variable called `FNR'.  It is reset to zero when a new file is
 started.  Another built-in variable, `NR', is the total number of input
 records read so far from all data files.  It starts at zero but is
 never automatically reset to zero.
 
Info Catalog (gawk) Reading Files (gawk) Reading Files (gawk) Fields
automatically generated byinfo2html